Hello guys
I hope this was not asked already, but I searched and didnβt find this type of error here. I wanted to ask about a strange behavior when training a model depending on the size of the NLU data.
Our setup is a kubernetes cluster with 1 node (8 vCPUs 32GB Memory) on GoogleCloud, Rasa is installed via Helm chart (helm.sh/chart: rasa-x-2.3.0
), using image image: rasa/rasa-x:0.42.6
.
We observed strange behavior when training models:
- Small NLU data samples are trained fast
- NLU data of up to like ~3000 examples is being trained within ~21min
- NLU data of 6300 examples takes ~1hr 3min, but is never uploaded/visible in the Rasa X UI
When we trigger the training via βTraining > Update model > train modelβ, it is started, and we see it work on Google Cloud
we also see it finish there in the graph, and also in the logs:0%| | 0/1580 [00:00<?, ?it/s] 1%| | 18/1580 [00:00<00:09, 171.11it/s] 2%|β | 36/1580 [00:00<00:10, 150.30it/s] 3%|β | 52/1580 [00:00<00:10, 152.19it/s] 4%|β | 69/1580 [00:00<00:09, 157.55it/s] 5%|β | 86/1580 [00:00<00:09, 158.58it/s] 7%|β | 103/1580 [00:00<00:09, 161.52it/s] 8%|β | 120/1580 [00:00<00:09, 160.92it/s] 9%|β | 137/1580 [00:00<00:09, 157.70it/s] 10%|β | 153/1580 [00:00<00:09, 155.89it/s] 11%|β | 169/1580 [00:01<00:09, 150.57it/s] 12%|ββ | 186/1580 [00:01<00:08, 155.16it/s] 13%|ββ | 202/1580 [00:01<00:09, 151.99it/s] 14%|ββ | 218/1580 [00:01<00:08, 153.95it/s] 15%|ββ | 236/1580 [00:01<00:08, 160.53it/s] 16%|ββ | 253/1580 [00:01<00:08, 159.56it/s] 17%|ββ | 269/1580 [00:01<00:08, 159.05it/s] 18%|ββ | 286/1580 [00:01<00:07, 161.99it/s] 19%|ββ | 304/1580 [00:01<00:07, 166.16it/s] 20%|ββ | 321/1580 [00:02<00:07, 164.33it/s] 21%|βββ | 338/1580 [00:02<00:07, 163.90it/s] 22%|βββ | 355/1580 [00:02<00:07, 157.75it/s] 24%|βββ | 373/1580 [00:02<00:07, 162.28it/s] 25%|βββ | 390/1580 [00:02<00:07, 159.08it/s] 26%|βββ | 408/1580 [00:02<00:07, 163.63it/s] 27%|βββ | 425/1580 [00:02<00:07, 161.95it/s] 28%|βββ | 442/1580 [00:02<00:07, 159.62it/s] 29%|βββ | 458/1580 [00:02<00:07, 158.76it/s] 30%|βββ | 474/1580 [00:02<00:06, 158.06it/s] 31%|βββ | 490/1580 [00:03<00:06, 157.34it/s] 32%|ββββ | 507/1580 [00:03<00:06, 159.27it/s] 33%|ββββ | 524/1580 [00:03<00:06, 160.90it/s] 34%|ββββ | 541/1580 [00:03<00:06, 156.93it/s] 35%|ββββ | 557/1580 [00:03<00:06, 156.39it/s] 36%|ββββ | 573/1580 [00:03<00:06, 153.56it/s] 37%|ββββ | 590/1580 [00:03<00:06, 155.78it/s] 38%|ββββ | 606/1580 [00:03<00:06, 155.31it/s] 39%|ββββ | 622/1580 [00:03<00:06, 155.68it/s] 40%|ββββ | 638/1580 [00:04<00:06, 153.00it/s] 41%|βββββ | 654/1580 [00:04<00:06, 153.70it/s] 42%|βββββ | 671/1580 [00:04<00:05, 157.38it/s] 44%|βββββ | 688/1580 [00:04<00:05, 160.32it/s] 45%|βββββ | 706/1580 [00:04<00:05, 163.47it/s] 46%|βββββ | 724/1580 [00:04<00:05, 167.60it/s] 47%|βββββ | 743/1580 [00:04<00:04, 172.29it/s] 48%|βββββ | 761/1580 [00:04<00:04, 167.99it/s] 49%|βββββ | 778/1580 [00:04<00:04, 165.51it/s] 50%|βββββ | 795/1580 [00:04<00:04, 164.02it/s] 51%|ββββββ | 812/1580 [00:05<00:04, 162.54it/s] 52%|ββββββ | 829/1580 [00:05<00:04, 162.52it/s] 54%|ββββββ | 846/1580 [00:05<00:04, 163.02it/s] 55%|ββββββ | 863/1580 [00:05<00:04, 162.45it/s] 56%|ββββββ | 880/1580 [00:05<00:04, 162.90it/s] 57%|ββββββ | 897/1580 [00:05<00:04, 161.27it/s] 58%|ββββββ | 915/1580 [00:05<00:04, 165.53it/s] 59%|ββββββ | 932/1580 [00:05<00:03, 163.26it/s] 60%|ββββββ | 949/1580 [00:05<00:04, 154.61it/s] 61%|ββββββ | 965/1580 [00:06<00:04, 151.97it/s] 62%|βββββββ | 981/1580 [00:06<00:03, 151.96it/s] 63%|βββββββ | 997/1580 [00:06<00:04, 144.90it/s] 64%|βββββββ | 1012/1580 [00:06<00:03, 143.60it/s] 65%|βββββββ | 1029/1580 [00:06<00:03, 148.92it/s] 66%|βββββββ | 1045/1580 [00:06<00:03, 151.40it/s] 67%|βββββββ | 1062/1580 [00:06<00:03, 154.96it/s] 68%|βββββββ | 1079/1580 [00:06<00:03, 155.11it/s] 69%|βββββββ | 1095/1580 [00:06<00:03, 156.08it/s] 70%|βββββββ | 1112/1580 [00:07<00:02, 157.52it/s] 71%|ββββββββ | 1129/1580 [00:07<00:02, 160.25it/s] 73%|ββββββββ | 1146/1580 [00:07<00:02, 156.67it/s] 74%|ββββββββ | 1164/1580 [00:07<00:02, 161.41it/s] 75%|ββββββββ | 1182/1580 [00:07<00:02, 165.87it/s] 76%|ββββββββ | 1199/1580 [00:07<00:02, 165.03it/s] 77%|ββββββββ | 1216/1580 [00:07<00:02, 164.52it/s] 78%|ββββββββ | 1234/1580 [00:07<00:02, 165.93it/s] 79%|ββββββββ | 1251/1580 [00:08<00:03, 100.12it/s] 80%|ββββββββ | 1269/1580 [00:08<00:02, 115.65it/s] 81%|βββββββββ | 1285/1580 [00:08<00:02, 123.93it/s] 82%|βββββββββ | 1301/1580 [00:08<00:02, 131.49it/s] 83%|βββββββββ | 1318/1580 [00:08<00:01, 139.51it/s] 84%|βββββββββ | 1335/1580 [00:08<00:01, 145.17it/s] 86%|βββββββββ | 1353/1580 [00:08<00:01, 152.23it/s] 87%|βββββββββ | 1370/1580 [00:08<00:01, 156.28it/s] 88%|βββββββββ | 1388/1580 [00:08<00:01, 160.69it/s] 89%|βββββββββ | 1405/1580 [00:09<00:01, 160.40it/s] 90%|βββββββββ | 1422/1580 [00:09<00:00, 159.77it/s] 91%|βββββββββ | 1439/1580 [00:09<00:00, 160.02it/s] 92%|ββββββββββ| 1456/1580 [00:09<00:00, 161.25it/s] 93%|ββββββββββ| 1473/1580 [00:09<00:00, 155.33it/s] 94%|ββββββββββ| 1489/1580 [00:09<00:00, 151.36it/s] 95%|ββββββββββ| 1505/1580 [00:09<00:00, 153.69it/s] 96%|ββββββββββ| 1522/1580 [00:09<00:00, 157.81it/s] 97%|ββββββββββ| 1540/1580 [00:09<00:00, 162.69it/s] 99%|ββββββββββ| 1557/1580 [00:09<00:00, 164.74it/s] 100%|ββββββββββ| 1576/1580 [00:10<00:00, 169.45it/s] 100%|ββββββββββ| 1580/1580 [00:10<00:00, 156.32it/s]
with good amount of iterations (~160it/s avg.). I
But why is it not finishing and βuploadingβ the model? There is always the error seen in the UI after some ~5mins, where the server responds with a 502 - badgateway:
but this doesnβt seem to stop the smaller sample sizes from being trained and uploaded:We know that Rasa is able to train NLU data of ~6k examples, because we usually did it on our own local machines (laptops) before, but wanted to centralize the training on a GoogleCloud cluster, as we want to use it frequently.
The training itself seems to be done without errors. The only warning/error messages we see, is some tensorflow messages saying no GPU is available, which is correct, since we didnβt enable that for our cluster. We also tried to implement a workflow from our NodeJS application to send a training request to this Rasa X server instance on Google Cloud, but we also get an 502 error there, with the request being cut after 1hr, without the model data.
What could be the reason (and solution!) to the big NLU data samples not being uploaded properly ?
Big thanks in advance!