Rasa X Training Fails at high epoch with DIETClassifier

I have about 3000 training data in json format. I train using Rasa X. It completes when I set DIETClassifier’s epoch at 70. But when I set it to 80, Rasa X UI says training has failed, but when I check the logs, I don’t see any errors in my worker and production containers.

Training log at 70 epochs: (training ends here and UI says successful)

Epochs: 100%|██████████| 70/70 [46:41<00:00, 40.03s/it, t_loss=3.49, i_loss=0.0592, i_acc=0.998]
2021-05-07 08:59:50 INFO     rasa.nlu.model  - Finished training component.
2021-05-07 08:59:50 INFO     rasa.nlu.model  - Starting to train component ResponseSelector
2021-05-07 08:59:50 INFO     rasa.nlu.selectors.response_selector  - Retrieval intent parameter was left to its default value. This response selector will be trained on training examples combining all retrieval intents.
2021-05-07 08:59:50 DEBUG    rasa.nlu.classifiers.diet_classifier  - Cannot train 'ResponseSelector'. No data was provided. Skipping training of the classifier.
2021-05-07 08:59:50 INFO     rasa.nlu.model  - Finished training component.
2021-05-07 08:59:50 INFO     rasa.nlu.model  - Starting to train component FallbackClassifier
2021-05-07 08:59:50 INFO     rasa.nlu.model  - Finished training component.
2021-05-07 08:59:51 INFO     rasa.nlu.model  - Successfully saved model into '/tmp/tmpmyqvl35y/nlu'

Training log at 80 epochs: (training continues to unto some augmentation rounds)

Epochs: 100%|██████████| 80/80 [53:39<00:00, 40.25s/it, t_loss=3.62, i_loss=0.0696, i_acc=0.996]
2021-05-07 09:56:03 INFO     rasa.nlu.model  - Finished training component.
2021-05-07 09:56:03 INFO     rasa.nlu.model  - Starting to train component ResponseSelector
2021-05-07 09:56:03 INFO     rasa.nlu.selectors.response_selector  - Retrieval intent parameter was left to its default value. This response selector will be trained on training examples combining all retrieval intents.
2021-05-07 09:56:03 DEBUG    rasa.nlu.classifiers.diet_classifier  - Cannot train 'ResponseSelector'. No data was provided. Skipping training of the classifier.
2021-05-07 09:56:03 INFO     rasa.nlu.model  - Finished training component.
2021-05-07 09:56:03 INFO     rasa.nlu.model  - Starting to train component FallbackClassifier
2021-05-07 09:56:03 INFO     rasa.nlu.model  - Finished training component.
2021-05-07 09:56:04 INFO     rasa.nlu.model  - Successfully saved model into '/tmp/tmple8mt6he/nlu'
2021-05-07 09:56:04 DEBUG    rasa.utils.tensorflow.models  - Loading the model from /tmp/tmple8mt6he/nlu/component_7_DIETClassifier.tf_model with finetune_mode=False...
2021-05-07 09:56:05 DEBUG    rasa.nlu.classifiers.diet_classifier  - Following metrics will be logged during training: 
2021-05-07 09:56:05 DEBUG    rasa.nlu.classifiers.diet_classifier  -   t_loss (total loss)
2021-05-07 09:56:05 DEBUG    rasa.nlu.classifiers.diet_classifier  -   i_acc (intent acc)
2021-05-07 09:56:05 DEBUG    rasa.nlu.classifiers.diet_classifier  -   i_loss (intent loss)
2021-05-07 09:56:07 DEBUG    rasa.core.agent  - Requesting model from server http://rasa-x:5002/api/projects/default/models/tags/production...
2021-05-07 09:56:07 DEBUG    rasa.core.agent  - Model server returned 204 status code, indicating that no new model is available. Current fingerprint: fea546c4fcffabc8716d3ae89f4619ca
2021-05-07 09:56:07 DEBUG    rasa.core.agent  - No new model found at URL http://rasa-x:5002/api/projects/default/models/tags/production
2021-05-07 09:56:10 DEBUG    rasa.utils.tensorflow.models  - Finished loading the model.
2021-05-07 09:56:10 DEBUG    rasa.nlu.classifiers.diet_classifier  - Failed to load model for 'ResponseSelector'. Maybe you did not provide enough training data and no model was trained or the path '/tmp/tmple8mt6he/nlu' doesn't exist?
2021-05-07 09:56:10 DEBUG    rasa.telemetry  - Could not read telemetry settings from configuration file: Configuration 'metrics' key not found.
2021-05-07 09:56:10 WARNING  rasa.utils.common  - Failed to write global config. Error: [Errno 13] Permission denied: '/.config/rasa'. Skipping.
/opt/venv/lib/python3.8/site-packages/rasa/core/policies/form_policy.py:51: FutureWarning: 'FormPolicy' is deprecated and will be removed in in the future. It is recommended to use the 'RulePolicy' instead. (will be removed in 3.0.0)
  rasa.shared.utils.io.raise_deprecation_warning(
/opt/venv/lib/python3.8/site-packages/rasa/shared/utils/io.py:97: UserWarning: It is not recommended to use the 'RulePolicy' with other policies which implement rule-like behavior. It is highly recommended to migrate all deprecated policies to use the 'RulePolicy'. Note that the 'RulePolicy' will supersede the predictions of the deprecated policies if the confidence levels of the predictions are equal.
  More info at https://rasa.com/docs/rasa/migration-guide
2021-05-07 09:56:10 DEBUG    rasa.core.nlg.generator  - Instantiated NLG to 'TemplatedNaturalLanguageGenerator'.
2021-05-07 09:56:10 DEBUG    rasa.shared.core.generator  - Number of augmentation rounds is 3
2021-05-07 09:56:10 DEBUG    rasa.shared.core.generator  - Starting data generation round 0 ... (with 1 trackers)
Processed story blocks: 100%|██████████| 661/661 [00:01<00:00, 586.09it/s, # trackers=1] 
2021-05-07 09:56:11 DEBUG    rasa.shared.core.generator  - Finished phase (674 training samples found).
2021-05-07 09:56:11 DEBUG    rasa.shared.core.generator  - Data generation rounds finished.
2021-05-07 09:56:11 DEBUG    rasa.shared.core.generator  - Found 0 unused checkpoints
2021-05-07 09:56:11 DEBUG    rasa.shared.core.generator  - Starting augmentation round 0 ... (with 50 trackers)
Processed story blocks:  15%|█▍        | 98/661 [00:05<00:59,  9.53it/s, # trackers=50]2021-05-07 09:56:17 DEBUG    rasa.core.agent  - Requesting model from server http://rasa-x:5002/api/projects/default/models/tags/production...

After that, it would then try to train the TED policy, after somewhere below 50%, the rasa x ui would report that training has failed, but I check the worker log, and it still continues until it finishes, but no model would be saved after.

I used the docker-compose manual method. My setup: Rasa X: 0.39.0 Rasa Version: 2.5

Has this something to do with my training data being json? but why is training succesful at 70 epochs and not 80? it’s weird.

Have you checked that it is not caused by a request timeout? The default request duration is 1 hour.

Hi @Anne.van.der.Bom, thanks for the reply, Are you referring to the SANIC_RESPONSE_TIMEOUT environment variable? It’s already set to 43200.

I’m just working from memory here…

Where did you set the SANIC_RESPONSE_TIMEOUT in your Rasa X instance or Rasa worker instance? You should set it in the latter.

Having said this, I’m not sure if that is the right method for increasing the response timeout, Rasa has a command line options for setting the response timeout: --response-timeout.

Afaik there is or was a bug (don’t know if solved yet) that if you start the Rasa Stack instance with the ‘rasa x’ command that the --response-timeout arg is ignored, only when started with ‘rasa run’.

I was triggered by the fact the duration of your training seems to be very close to 1 hour and that the increase in epochs takes it over the 1 hour limit.

Edit to add: you could try and initiate a training request directly on the train endpoint of the rasa http api, If it runs beyond the 1 hour timeout, then that is not the problem.

I set SANIC_RESPONSE_TIMEOUT in the docker-compose file, so i think that it is for bot rasa-x and rasa-worker instance.

Looking at the response timeout, i see that it defaults to 1 hr. This argument is not included in the command for rasa-x in the docker-compose file. I’ll try this first on local mode, then on docker-compose deployment mode.

Then I’ll try the train endpoint api. Thank you for your help. I’ll let you know how it goes. Have a nice day!

Note that you should set the timeout it in the worker instance, not Rasa X (it is the worker that hangs up on Rasa X, not Rasa X that gets tired of waiting if you understand me). I concede that it is confusing, but with the ‘rasa x’ command, you actually start a Rasa Stack instance, not the Rasa X ui.