We were using rasa 2.3.0 and faced no issue since then we have migrated to rasa 2.5.0 and are facing performance issues both while training the bot and while running it.
We are using git-action CI/CD for training our bot, and facing the following issue
Epochs: 93%|█████████▎| 93/100 [1:17:52<09:53, 84.76s/it, t_loss=3.17, i_acc=0.993, e_f1=1] /home/runner/work/_temp/4d06fe35-724b-4801-9dfb-d678b029642c.sh: line 1: 2336 Killed rasa train --augmentation 0
Error: Process completed with exit code 137.
As you can see when the training process is almost done, it’s stopped due to a memory issue.
We are facing something similar on our deployment, we are using AWS-ECS in which in a single task definition we are running two bot(different language, but identical otherwise), we were able to run these bot with 2GB memory, now unless I increase the memory to 3GB the task is not able to run both the bot and keep on restarting due to memory issue.
There are two major changes that could have resulted in this one was the upgrad from 2.3.0 to 2.5.0, the other was config changes to support rasa 2.5.0
The config for 2.3 was the following
language: es
pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100
- name: FallbackClassifier
threshold: 0.3
ambiguity_threshold: 0.1
policies:
- name: MemoizationPolicy
- name: TEDPolicy
max_history: 5
epochs: 551
- name: RulePolicy
And for 2.5 it was the following
language: es
pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
constrain_similarities: true
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100
constrain_similarities: true
- name: FallbackClassifier
threshold: 0.3
ambiguity_threshold: 0.1
policies:
- name: MemoizationPolicy
- name: TEDPolicy
max_history: 5
epochs: 100
constrain_similarities: true
- name: RulePolicy
I have also noticed that the size of the Model was previously 120MB now it’s ~150MB.
I have no clue how to proceed with this, any hint or guidance is much appreciated.
Today I tried to do the training on my local machine, I used the following command for the same
docker run -v $(pwd):/app rasa/rasa:2.6.1-full train --augmentation 0
The training process exited prematurely, tried many times, have used the same system to train the model just a few weeks before when I was using 2.3.0.
My system configuration is following
This was happening with a 2GB swap, which was the default in the system. When I increased this to 8 GB I was able to train the model, but then it failed due to a validation issue which I have raised here.
Any suggestion why this is happening after the upgrade or is this the expected behavior now?
If this is expected then we will have to update our CI pipeline which is using Git-Action free service(I think they provide 1CPU, 3.5 GB ram, not too sure though), please let me know either way.
hi @madanmeena, this doesn’t seem too out of expectations. We recommend providing at least 4CPU and 4GiB of memory for the rasa-worker
pod in our helm chart that does the model training… but it really depends on your pipeline and the size of your data.
constrain_similarities
set to true is also not a requirement but a recommendation. So if you want to keep it at the same size, you can revert it back.
It’s just that we have been training the model through Git-Action Ci pipeline for the last year and never faced any issue, What you say about the rasa-x configuration is true, but most of the time we do not use it for training as Git seems more appropriate to us, rasa-x can only be configured with one branch at a time and we need to train the model on multiple branches so it’s easier to use CI for this.
We were able to run our CI pipeline free of cost till now, so just wanted to double-check before I used dedicated machine for training purpose, as that would add to cost.
Regarding size, my main concern performance, so if I am not getting any benefit by reducing the size then I can leave it, but if the size is related to the memory issue we are facing then I can try without it.
As we are facing issue not only during model training but also while running the bot.
does training work with your old pipeline on the newer version? Given your constraints, I would first evaluate if the updated config really gives you that much of a performance boost. If the results aren’t drastically better, then stay with your old pipeline that can run with fewer resources.