Performance issue with rasa >=2.5.0

madanmeena · May 17, 2021, 12:54pm

We were using rasa 2.3.0 and faced no issue since then we have migrated to rasa 2.5.0 and are facing performance issues both while training the bot and while running it.

We are using git-action CI/CD for training our bot, and facing the following issue

Epochs: 93%|█████████▎| 93/100 [1:17:52<09:53, 84.76s/it, t_loss=3.17, i_acc=0.993, e_f1=1] /home/runner/work/_temp/4d06fe35-724b-4801-9dfb-d678b029642c.sh: line 1: 2336 Killed rasa train --augmentation 0 Error: Process completed with exit code 137.

As you can see when the training process is almost done, it’s stopped due to a memory issue.

We are facing something similar on our deployment, we are using AWS-ECS in which in a single task definition we are running two bot(different language, but identical otherwise), we were able to run these bot with 2GB memory, now unless I increase the memory to 3GB the task is not able to run both the bot and keep on restarting due to memory issue.

There are two major changes that could have resulted in this one was the upgrad from 2.3.0 to 2.5.0, the other was config changes to support rasa 2.5.0

The config for 2.3 was the following language: es pipeline:

name: WhitespaceTokenizer
name: RegexFeaturizer
name: LexicalSyntacticFeaturizer
name: CountVectorsFeaturizer
name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4
name: DIETClassifier epochs: 100
name: EntitySynonymMapper
name: ResponseSelector epochs: 100
name: FallbackClassifier threshold: 0.3 ambiguity_threshold: 0.1 policies:
name: MemoizationPolicy
name: TEDPolicy max_history: 5 epochs: 551
name: RulePolicy

And for 2.5 it was the following language: es pipeline:

name: WhitespaceTokenizer
name: RegexFeaturizer
name: LexicalSyntacticFeaturizer
name: CountVectorsFeaturizer
name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4
name: DIETClassifier epochs: 100 constrain_similarities: true
name: EntitySynonymMapper
name: ResponseSelector epochs: 100 constrain_similarities: true
name: FallbackClassifier threshold: 0.3 ambiguity_threshold: 0.1 policies:
name: MemoizationPolicy
name: TEDPolicy max_history: 5 epochs: 100 constrain_similarities: true
name: RulePolicy

I have also noticed that the size of the Model was previously 120MB now it’s ~150MB. I have no clue how to proceed with this, any hint or guidance is much appreciated.

madanmeena · May 18, 2021, 9:52am

Today I tried to do the training on my local machine, I used the following command for the same

docker run -v $(pwd):/app rasa/rasa:2.6.1-full train --augmentation 0

The training process exited prematurely, tried many times, have used the same system to train the model just a few weeks before when I was using 2.3.0.

My system configuration is following

Screenshot from 2021-05-18 14-29-22

This was happening with a 2GB swap, which was the default in the system. When I increased this to 8 GB I was able to train the model, but then it failed due to a validation issue which I have raised here.

Any suggestion why this is happening after the upgrade or is this the expected behavior now?

If this is expected then we will have to update our CI pipeline which is using Git-Action free service(I think they provide 1CPU, 3.5 GB ram, not too sure though), please let me know either way.

desmarchris · May 20, 2021, 4:24pm

hi @madanmeena, this doesn’t seem too out of expectations. We recommend providing at least 4CPU and 4GiB of memory for the rasa-worker pod in our helm chart that does the model training… but it really depends on your pipeline and the size of your data.

constrain_similarities set to true is also not a requirement but a recommendation. So if you want to keep it at the same size, you can revert it back.

madanmeena · May 20, 2021, 11:51pm

It’s just that we have been training the model through Git-Action Ci pipeline for the last year and never faced any issue, What you say about the rasa-x configuration is true, but most of the time we do not use it for training as Git seems more appropriate to us, rasa-x can only be configured with one branch at a time and we need to train the model on multiple branches so it’s easier to use CI for this.

We were able to run our CI pipeline free of cost till now, so just wanted to double-check before I used dedicated machine for training purpose, as that would add to cost.

Regarding size, my main concern performance, so if I am not getting any benefit by reducing the size then I can leave it, but if the size is related to the memory issue we are facing then I can try without it.

As we are facing issue not only during model training but also while running the bot.

desmarchris · May 21, 2021, 11:59am

does training work with your old pipeline on the newer version? Given your constraints, I would first evaluate if the updated config really gives you that much of a performance boost. If the results aren’t drastically better, then stay with your old pipeline that can run with fewer resources.

Topic		Replies	Views
Traceback most recent call Rasa Open Source	14	1585	October 29, 2021
[ASK] Process get killed when training RASA core Rasa Open Source	22	2463	October 6, 2023
Unable to run rasa server on AWS ec2 Rasa Open Source	11	1080	February 23, 2022
Long response time Rasa Open Source	6	1073	February 26, 2024
Any way to to check Training process Rasa Open Source	2	665	July 16, 2019

Performance issue with rasa >=2.5.0

Related topics