RasaX is not generating new model when the training process takes more than 1 hour

RasaX Version:- 0.32.2 RasaCore Version:- 1.10.12 Environment:- Kubernetes

The RasaX UI is not generating new model when the training process takes more than 1 hour in Kubernetes environment. RasaX is calling Rasa worker node to train the model and the process is running fine but we don’t see it in RasaX UI or model folder and we don’t see any errors in the log (attached).rasaworker_log.txt (670.4 KB)

Increased SANIC_RESPONSE_TIMEOUT to 2 hours and tested again but it didn’t generate the model. The Rasaworker service log shows rasaworker_log2.txt (9.5 KB) “Starting RasaX in production mode” after 88% of the model process completed and it continued after this message.

Does this mean RasaX Server restarted while Rasaworker node is generating model? What will happen if RasaX server restarts while Rasaworker node is in the process of generating model which usually will happen in Kubernetes environment? Will it still copy the file from Rasaworker service to RasaX service once the Rasaworker completes the process?

How the model will be copied from Rasaworker service to RasaX service? I assume it will copy through API Call from RasaX to Rasaworker? I couldn’t find any documentation on this process.

9/17/2020 5:01:31 PM Epochs: 87%|████████▋ | 87/100 [52:24<06:47, 31.31s/it, t_loss=1.030, i_loss=0.021, entity_loss=0.023, i_acc=1.000, entity_f1=0.996] Epochs: 88%|████████▊ | 88/100 [52:24<06:17, 31.43s/it, t_loss=1.030, i_loss=0.021, entity_loss=0.023, i_acc=1.000, entity_f1=0.996]2020-09-17 21:01:31 DEBUG rasa.server - Traceback (most recent call last):

9/17/2020 5:01:31 PM File “/opt/venv/lib/python3.7/site-packages/rasa/server.py”, line 810, in train

9/17/2020 5:01:31 PM None, functools.partial(train_model, **info)

9/17/2020 5:01:31 PM concurrent.futures._base.CancelledError

9/17/2020 5:01:31 PM

9/17/2020 5:01:31 PM Starting Rasa X in production mode… :rocket:

9/17/2020 5:01:58 PM 2020-09-17 21:01:58 DEBUG rasa.core.agent - No new model found at URL http://aloha-combined-botrasaxserver:80/api/projects/default/models/tags/production

9/17/2020 5:02:08 PM Epochs: 88%|████████▊ | 88/100 [52:56<06:17, 31.43s/it, t_loss=1.058, i_loss=0.015, entity_loss=0.013, i_acc=1.000, entity_f1=0.996] Epochs: 89%|████████▉ | 89/100 [52:56<05:47, 31.55s/it, t_loss=1.058, i_loss=0.015, entity_loss=0.013, i_acc=1.000, entity_f1=0.996]2020-09-17 21:02:08 DEBUG rasa.core.agent - Requesting model from server http://aloha-combined-botrasaxserver:80/api/projects/default/models/tags/production

When it tested with a smaller set of NLU Data file , the train model process took just 15 minutes and everything works fine. We can see the new model in RasaX UI and models folder.

We have configured the following environment variables

SANIC_RESPONSE_TIMEOUT=7200

SANIC_REQUEST_MAX_SIZE_IN_BYTES = 800000000

SANIC_ACCESS_CONTROL_MAX_AGE=1800

Are there any other environment variables that we should configure? Can someone please advise.

Thanks

Hari

Re. where models are called from - Models are always persisted to the same model storage and not copied over. Rasa X can tag one as active/production, in which case rasa-x, rasa-worker and rasa-prod will use that model as the reference for the active model (the files still sit in the same place on the model server). In the first logs for rasa-worker, it looks like the whole training process completes (based on these lines):

9/17/2020 1:48:23 PM 2020-09-17 17:48:23 DEBUG rasa.nlu.classifiers.diet_classifier - Cannot train 'ResponseSelector'. No data was provided. Skipping training of the classifier.
9/17/2020 1:48:23 PM 2020-09-17 17:48:23 INFO rasa.nlu.model - Finished training component.
9/17/2020 1:48:24 PM 2020-09-17 17:48:24 INFO rasa.nlu.model - Successfully saved model into '/tmp/tmp5z9zfu42/nlu'

Which makes me think it’s not a time out issue.

In the second case, it does look like something went wrong:

9/17/2020 5:01:31 PM File "/opt/venv/lib/python3.7/site-packages/rasa/server.py", line 810, in train
9/17/2020 5:01:31 PM None, functools.partial(train_model, **info)
9/17/2020 5:01:31 PM concurrent.futures._base.CancelledError
9/17/2020 5:01:31 PM
9/17/2020 5:01:31 PM Starting Rasa X in production mode... ??
9/17/2020 5:01:31 PM Training Core model...
9/17/2020 5:01:31 PM Core model training completed.
9/17/2020 5:01:31 PM Training NLU model...
9/17/2020 5:01:31 PM [2020-09-17 21:01:31 +0000] [1] [ERROR] Exception occurred while handling uri: 'http://aloha-combined-botrasaworker:5005/model/train?token=7EAOpaQOX%2BDKy1WcOxq0NA%3D%3D'
9/17/2020 5:01:31 PM NoneType: None

This looks like something went wrong with the model server itself. Is this behaviour consistent, or did it only happen once?