I have about 3000 training data in json format. I train using Rasa X. It completes when I set DIETClassifier’s epoch at 70. But when I set it to 80, Rasa X UI says training has failed, but when I check the logs, I don’t see any errors in my worker and production containers.
Training log at 70 epochs: (training ends here and UI says successful)
Epochs: 100%|██████████| 70/70 [46:41<00:00, 40.03s/it, t_loss=3.49, i_loss=0.0592, i_acc=0.998] 2021-05-07 08:59:50 INFO rasa.nlu.model - Finished training component. 2021-05-07 08:59:50 INFO rasa.nlu.model - Starting to train component ResponseSelector 2021-05-07 08:59:50 INFO rasa.nlu.selectors.response_selector - Retrieval intent parameter was left to its default value. This response selector will be trained on training examples combining all retrieval intents. 2021-05-07 08:59:50 DEBUG rasa.nlu.classifiers.diet_classifier - Cannot train 'ResponseSelector'. No data was provided. Skipping training of the classifier. 2021-05-07 08:59:50 INFO rasa.nlu.model - Finished training component. 2021-05-07 08:59:50 INFO rasa.nlu.model - Starting to train component FallbackClassifier 2021-05-07 08:59:50 INFO rasa.nlu.model - Finished training component. 2021-05-07 08:59:51 INFO rasa.nlu.model - Successfully saved model into '/tmp/tmpmyqvl35y/nlu'
Training log at 80 epochs: (training continues to unto some augmentation rounds)
Epochs: 100%|██████████| 80/80 [53:39<00:00, 40.25s/it, t_loss=3.62, i_loss=0.0696, i_acc=0.996] 2021-05-07 09:56:03 INFO rasa.nlu.model - Finished training component. 2021-05-07 09:56:03 INFO rasa.nlu.model - Starting to train component ResponseSelector 2021-05-07 09:56:03 INFO rasa.nlu.selectors.response_selector - Retrieval intent parameter was left to its default value. This response selector will be trained on training examples combining all retrieval intents. 2021-05-07 09:56:03 DEBUG rasa.nlu.classifiers.diet_classifier - Cannot train 'ResponseSelector'. No data was provided. Skipping training of the classifier. 2021-05-07 09:56:03 INFO rasa.nlu.model - Finished training component. 2021-05-07 09:56:03 INFO rasa.nlu.model - Starting to train component FallbackClassifier 2021-05-07 09:56:03 INFO rasa.nlu.model - Finished training component. 2021-05-07 09:56:04 INFO rasa.nlu.model - Successfully saved model into '/tmp/tmple8mt6he/nlu' 2021-05-07 09:56:04 DEBUG rasa.utils.tensorflow.models - Loading the model from /tmp/tmple8mt6he/nlu/component_7_DIETClassifier.tf_model with finetune_mode=False... 2021-05-07 09:56:05 DEBUG rasa.nlu.classifiers.diet_classifier - Following metrics will be logged during training: 2021-05-07 09:56:05 DEBUG rasa.nlu.classifiers.diet_classifier - t_loss (total loss) 2021-05-07 09:56:05 DEBUG rasa.nlu.classifiers.diet_classifier - i_acc (intent acc) 2021-05-07 09:56:05 DEBUG rasa.nlu.classifiers.diet_classifier - i_loss (intent loss) 2021-05-07 09:56:07 DEBUG rasa.core.agent - Requesting model from server http://rasa-x:5002/api/projects/default/models/tags/production... 2021-05-07 09:56:07 DEBUG rasa.core.agent - Model server returned 204 status code, indicating that no new model is available. Current fingerprint: fea546c4fcffabc8716d3ae89f4619ca 2021-05-07 09:56:07 DEBUG rasa.core.agent - No new model found at URL http://rasa-x:5002/api/projects/default/models/tags/production 2021-05-07 09:56:10 DEBUG rasa.utils.tensorflow.models - Finished loading the model. 2021-05-07 09:56:10 DEBUG rasa.nlu.classifiers.diet_classifier - Failed to load model for 'ResponseSelector'. Maybe you did not provide enough training data and no model was trained or the path '/tmp/tmple8mt6he/nlu' doesn't exist? 2021-05-07 09:56:10 DEBUG rasa.telemetry - Could not read telemetry settings from configuration file: Configuration 'metrics' key not found. 2021-05-07 09:56:10 WARNING rasa.utils.common - Failed to write global config. Error: [Errno 13] Permission denied: '/.config/rasa'. Skipping. /opt/venv/lib/python3.8/site-packages/rasa/core/policies/form_policy.py:51: FutureWarning: 'FormPolicy' is deprecated and will be removed in in the future. It is recommended to use the 'RulePolicy' instead. (will be removed in 3.0.0) rasa.shared.utils.io.raise_deprecation_warning( /opt/venv/lib/python3.8/site-packages/rasa/shared/utils/io.py:97: UserWarning: It is not recommended to use the 'RulePolicy' with other policies which implement rule-like behavior. It is highly recommended to migrate all deprecated policies to use the 'RulePolicy'. Note that the 'RulePolicy' will supersede the predictions of the deprecated policies if the confidence levels of the predictions are equal. More info at https://rasa.com/docs/rasa/migration-guide 2021-05-07 09:56:10 DEBUG rasa.core.nlg.generator - Instantiated NLG to 'TemplatedNaturalLanguageGenerator'. 2021-05-07 09:56:10 DEBUG rasa.shared.core.generator - Number of augmentation rounds is 3 2021-05-07 09:56:10 DEBUG rasa.shared.core.generator - Starting data generation round 0 ... (with 1 trackers) Processed story blocks: 100%|██████████| 661/661 [00:01<00:00, 586.09it/s, # trackers=1] 2021-05-07 09:56:11 DEBUG rasa.shared.core.generator - Finished phase (674 training samples found). 2021-05-07 09:56:11 DEBUG rasa.shared.core.generator - Data generation rounds finished. 2021-05-07 09:56:11 DEBUG rasa.shared.core.generator - Found 0 unused checkpoints 2021-05-07 09:56:11 DEBUG rasa.shared.core.generator - Starting augmentation round 0 ... (with 50 trackers) Processed story blocks: 15%|█▍ | 98/661 [00:05<00:59, 9.53it/s, # trackers=50]2021-05-07 09:56:17 DEBUG rasa.core.agent - Requesting model from server http://rasa-x:5002/api/projects/default/models/tags/production...
After that, it would then try to train the TED policy, after somewhere below 50%, the rasa x ui would report that training has failed, but I check the worker log, and it still continues until it finishes, but no model would be saved after.
I used the docker-compose manual method. My setup: Rasa X: 0.39.0 Rasa Version: 2.5
Has this something to do with my training data being json? but why is training succesful at 70 epochs and not 80? it’s weird.