RasaX Version:- 0.32.2 RasaCore Version:- 1.10.12 Environment:- Kubernetes
The RasaX UI is not generating new model when the training process takes more than 1 hour in Kubernetes environment. RasaX is calling Rasa worker node to train the model and the process is running fine but we don’t see it in RasaX UI or model folder and we don’t see any errors in the log (attached).rasaworker_log.txt (670.4 KB)
Increased SANIC_RESPONSE_TIMEOUT to 2 hours and tested again but it didn’t generate the model. The Rasaworker service log shows rasaworker_log2.txt (9.5 KB) “Starting RasaX in production mode” after 88% of the model process completed and it continued after this message.
Does this mean RasaX Server restarted while Rasaworker node is generating model? What will happen if RasaX server restarts while Rasaworker node is in the process of generating model which usually will happen in Kubernetes environment? Will it still copy the file from Rasaworker service to RasaX service once the Rasaworker completes the process?
How the model will be copied from Rasaworker service to RasaX service? I assume it will copy through API Call from RasaX to Rasaworker? I couldn’t find any documentation on this process.
9/17/2020 5:01:31 PM Epochs: 87%|████████▋ | 87/100 [52:24<06:47, 31.31s/it, t_loss=1.030, i_loss=0.021, entity_loss=0.023, i_acc=1.000, entity_f1=0.996] Epochs: 88%|████████▊ | 88/100 [52:24<06:17, 31.43s/it, t_loss=1.030, i_loss=0.021, entity_loss=0.023, i_acc=1.000, entity_f1=0.996]2020-09-17 21:01:31 DEBUG rasa.server - Traceback (most recent call last):
9/17/2020 5:01:31 PM File “/opt/venv/lib/python3.7/site-packages/rasa/server.py”, line 810, in train
9/17/2020 5:01:31 PM None, functools.partial(train_model, **info)
9/17/2020 5:01:31 PM concurrent.futures._base.CancelledError
9/17/2020 5:01:31 PM
9/17/2020 5:01:31 PM Starting Rasa X in production mode…
9/17/2020 5:01:58 PM 2020-09-17 21:01:58 DEBUG rasa.core.agent - No new model found at URL http://aloha-combined-botrasaxserver:80/api/projects/default/models/tags/production
9/17/2020 5:02:08 PM Epochs: 88%|████████▊ | 88/100 [52:56<06:17, 31.43s/it, t_loss=1.058, i_loss=0.015, entity_loss=0.013, i_acc=1.000, entity_f1=0.996] Epochs: 89%|████████▉ | 89/100 [52:56<05:47, 31.55s/it, t_loss=1.058, i_loss=0.015, entity_loss=0.013, i_acc=1.000, entity_f1=0.996]2020-09-17 21:02:08 DEBUG rasa.core.agent - Requesting model from server http://aloha-combined-botrasaxserver:80/api/projects/default/models/tags/production
When it tested with a smaller set of NLU Data file , the train model process took just 15 minutes and everything works fine. We can see the new model in RasaX UI and models folder.
We have configured the following environment variables
SANIC_RESPONSE_TIMEOUT=7200
SANIC_REQUEST_MAX_SIZE_IN_BYTES = 800000000
SANIC_ACCESS_CONTROL_MAX_AGE=1800
Are there any other environment variables that we should configure? Can someone please advise.
Thanks
Hari