Rasa Training not completing with large NLU data

Hello guys

I hope this was not asked already, but I searched and didn’t find this type of error here. I wanted to ask about a strange behavior when training a model depending on the size of the NLU data.

Our setup is a kubernetes cluster with 1 node (8 vCPUs 32GB Memory) on GoogleCloud, Rasa is installed via Helm chart (helm.sh/chart: rasa-x-2.3.0), using image image: rasa/rasa-x:0.42.6.

We observed strange behavior when training models:

  • Small NLU data samples are trained fast
  • NLU data of up to like ~3000 examples is being trained within ~21min
  • NLU data of 6300 examples takes ~1hr 3min, but is never uploaded/visible in the Rasa X UI

When we trigger the training via β€œTraining > Update model > train model”, it is started, and we see it work on Google Cloud

we also see it finish there in the graph, and also in the logs:

0%| | 0/1580 [00:00<?, ?it/s] 1%| | 18/1580 [00:00<00:09, 171.11it/s] 2%|▏ | 36/1580 [00:00<00:10, 150.30it/s] 3%|β–Ž | 52/1580 [00:00<00:10, 152.19it/s] 4%|▍ | 69/1580 [00:00<00:09, 157.55it/s] 5%|β–Œ | 86/1580 [00:00<00:09, 158.58it/s] 7%|β–‹ | 103/1580 [00:00<00:09, 161.52it/s] 8%|β–Š | 120/1580 [00:00<00:09, 160.92it/s] 9%|β–Š | 137/1580 [00:00<00:09, 157.70it/s] 10%|β–‰ | 153/1580 [00:00<00:09, 155.89it/s] 11%|β–ˆ | 169/1580 [00:01<00:09, 150.57it/s] 12%|β–ˆβ– | 186/1580 [00:01<00:08, 155.16it/s] 13%|β–ˆβ–Ž | 202/1580 [00:01<00:09, 151.99it/s] 14%|β–ˆβ– | 218/1580 [00:01<00:08, 153.95it/s] 15%|β–ˆβ– | 236/1580 [00:01<00:08, 160.53it/s] 16%|β–ˆβ–Œ | 253/1580 [00:01<00:08, 159.56it/s] 17%|β–ˆβ–‹ | 269/1580 [00:01<00:08, 159.05it/s] 18%|β–ˆβ–Š | 286/1580 [00:01<00:07, 161.99it/s] 19%|β–ˆβ–‰ | 304/1580 [00:01<00:07, 166.16it/s] 20%|β–ˆβ–ˆ | 321/1580 [00:02<00:07, 164.33it/s] 21%|β–ˆβ–ˆβ– | 338/1580 [00:02<00:07, 163.90it/s] 22%|β–ˆβ–ˆβ– | 355/1580 [00:02<00:07, 157.75it/s] 24%|β–ˆβ–ˆβ–Ž | 373/1580 [00:02<00:07, 162.28it/s] 25%|β–ˆβ–ˆβ– | 390/1580 [00:02<00:07, 159.08it/s] 26%|β–ˆβ–ˆβ–Œ | 408/1580 [00:02<00:07, 163.63it/s] 27%|β–ˆβ–ˆβ–‹ | 425/1580 [00:02<00:07, 161.95it/s] 28%|β–ˆβ–ˆβ–Š | 442/1580 [00:02<00:07, 159.62it/s] 29%|β–ˆβ–ˆβ–‰ | 458/1580 [00:02<00:07, 158.76it/s] 30%|β–ˆβ–ˆβ–ˆ | 474/1580 [00:02<00:06, 158.06it/s] 31%|β–ˆβ–ˆβ–ˆ | 490/1580 [00:03<00:06, 157.34it/s] 32%|β–ˆβ–ˆβ–ˆβ– | 507/1580 [00:03<00:06, 159.27it/s] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 524/1580 [00:03<00:06, 160.90it/s] 34%|β–ˆβ–ˆβ–ˆβ– | 541/1580 [00:03<00:06, 156.93it/s] 35%|β–ˆβ–ˆβ–ˆβ–Œ | 557/1580 [00:03<00:06, 156.39it/s] 36%|β–ˆβ–ˆβ–ˆβ–‹ | 573/1580 [00:03<00:06, 153.56it/s] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 590/1580 [00:03<00:06, 155.78it/s] 38%|β–ˆβ–ˆβ–ˆβ–Š | 606/1580 [00:03<00:06, 155.31it/s] 39%|β–ˆβ–ˆβ–ˆβ–‰ | 622/1580 [00:03<00:06, 155.68it/s] 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 638/1580 [00:04<00:06, 153.00it/s] 41%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 654/1580 [00:04<00:06, 153.70it/s] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 671/1580 [00:04<00:05, 157.38it/s] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 688/1580 [00:04<00:05, 160.32it/s] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 706/1580 [00:04<00:05, 163.47it/s] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 724/1580 [00:04<00:05, 167.60it/s] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 743/1580 [00:04<00:04, 172.29it/s] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 761/1580 [00:04<00:04, 167.99it/s] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 778/1580 [00:04<00:04, 165.51it/s] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 795/1580 [00:04<00:04, 164.02it/s] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 812/1580 [00:05<00:04, 162.54it/s] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 829/1580 [00:05<00:04, 162.52it/s] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 846/1580 [00:05<00:04, 163.02it/s] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 863/1580 [00:05<00:04, 162.45it/s] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 880/1580 [00:05<00:04, 162.90it/s] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 897/1580 [00:05<00:04, 161.27it/s] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 915/1580 [00:05<00:04, 165.53it/s] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 932/1580 [00:05<00:03, 163.26it/s] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 949/1580 [00:05<00:04, 154.61it/s] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 965/1580 [00:06<00:04, 151.97it/s] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 981/1580 [00:06<00:03, 151.96it/s] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 997/1580 [00:06<00:04, 144.90it/s] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1012/1580 [00:06<00:03, 143.60it/s] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1029/1580 [00:06<00:03, 148.92it/s] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1045/1580 [00:06<00:03, 151.40it/s] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1062/1580 [00:06<00:03, 154.96it/s] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1079/1580 [00:06<00:03, 155.11it/s] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1095/1580 [00:06<00:03, 156.08it/s] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1112/1580 [00:07<00:02, 157.52it/s] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1129/1580 [00:07<00:02, 160.25it/s] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1146/1580 [00:07<00:02, 156.67it/s] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1164/1580 [00:07<00:02, 161.41it/s] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1182/1580 [00:07<00:02, 165.87it/s] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1199/1580 [00:07<00:02, 165.03it/s] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1216/1580 [00:07<00:02, 164.52it/s] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1234/1580 [00:07<00:02, 165.93it/s] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1251/1580 [00:08<00:03, 100.12it/s] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1269/1580 [00:08<00:02, 115.65it/s] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1285/1580 [00:08<00:02, 123.93it/s] 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1301/1580 [00:08<00:02, 131.49it/s] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 1318/1580 [00:08<00:01, 139.51it/s] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 1335/1580 [00:08<00:01, 145.17it/s] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 1353/1580 [00:08<00:01, 152.23it/s] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1370/1580 [00:08<00:01, 156.28it/s] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 1388/1580 [00:08<00:01, 160.69it/s] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 1405/1580 [00:09<00:01, 160.40it/s] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1422/1580 [00:09<00:00, 159.77it/s] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1439/1580 [00:09<00:00, 160.02it/s] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1456/1580 [00:09<00:00, 161.25it/s] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 1473/1580 [00:09<00:00, 155.33it/s] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1489/1580 [00:09<00:00, 151.36it/s] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 1505/1580 [00:09<00:00, 153.69it/s] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1522/1580 [00:09<00:00, 157.81it/s] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 1540/1580 [00:09<00:00, 162.69it/s] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 1557/1580 [00:09<00:00, 164.74it/s] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1576/1580 [00:10<00:00, 169.45it/s] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1580/1580 [00:10<00:00, 156.32it/s]

with good amount of iterations (~160it/s avg.). I

But why is it not finishing and β€œuploading” the model? There is always the error seen in the UI after some ~5mins, where the server responds with a 502 - badgateway:

but this doesn’t seem to stop the smaller sample sizes from being trained and uploaded:

We know that Rasa is able to train NLU data of ~6k examples, because we usually did it on our own local machines (laptops) before, but wanted to centralize the training on a GoogleCloud cluster, as we want to use it frequently.

The training itself seems to be done without errors. The only warning/error messages we see, is some tensorflow messages saying no GPU is available, which is correct, since we didn’t enable that for our cluster. We also tried to implement a workflow from our NodeJS application to send a training request to this Rasa X server instance on Google Cloud, but we also get an 502 error there, with the request being cut after 1hr, without the model data.

What could be the reason (and solution!) to the big NLU data samples not being uploaded properly ?

Big thanks in advance!

1 Like

@nik202 sorry to tag you, but I’ve seen you are knowledgable regular here (and mentioned the possibility to tag you), do you have any ideas about the issue? Thank you

Hello @axral well I am not sure if this is related to training data but as you mentioned training takes more than 1hr 3min (63mins) for 63K and your model is trained but it’s not showing, then possible issue can be session expire as by default it set to :

session_config:
  session_expiration_time: 60  # value in minutes, 0 means infinitely long
  carry_over_slots_to_new_session: true  # set to false to forget slots between sessions

Till the time model is trained, it enters expire the session and so is the issue. I can be totally wrong on this as I never get the chance on working such an issue.

Suggestions:

  1. Try to increase the session time from 60+
  2. Train the model locally based on Rasa open source which you are using I guess 2.3 and then upload manually?
  3. Clear the browser history

Also, please share with me how you install Rasa X on GC?

I hope this will help you and thanks for taginng me :slight_smile:

Hey nik202, thanks for your response!

before your response, I tried setting the session expiration time to like 60 000, but it wasn’t able to upload it still. I will try the β€œcarry over slots to new session” parameter next and see if it makes a difference.

As for your suggestions, the second point is what we used to do for such large data sets: we trained them manually/locally on our dev machines and uploaded the models, but this is the thing we want to move away from, as it’s blocking our dev machines due to computing resources and we can’t really continue to work with it running in the background, plus it takes ages to finish. I’m not sure how clearing the browser history could help, but I will try it nontheless.

As for the installation of Rasa X, I’m not 100% sure as I didn’t install it, but from what I know, we set up a kubernetes cluster and followed the installation instructions from the Rasa Helm chart installation guide on GitHub.

Thanks for the help so far, I’m curious to maybe hear more form you or others on this issue.