Reduce RASA model memory consumption or load time

Hi Team,

When we deploy a rasa model (NLU+Core) it takes around 700MB of memory per model. Please help me to reduce model memory consumption. I am running a RASA model as rasa run with enable API.

I have over 60 models to be deployed which makes a lot of load on memory. Please help me to reduce the same and how can it be optimized.

Also let me know if there is any way to run nlu parse calls without running a nlu model service.

Let’s start with your config, what are the different components you are using and why are they needed. It is important to understand the dimensions. Are you using those pre trained models?

How big is your training data? How many intents/stories you have.

You can use the pythonic way to running the service but it doesn’t come with any support. You can go through the code and implement it.it’s like literally import rasa and then you go on from there. Otherwise follow the documentation of starting a rasa server

Data on an average per model is 6 intents with 3 or 4 intents each and 2 or 3 stories.

Below is the config for all my models.

language: en

pipeline:

  • name: WhitespaceTokenizer
  • name: RegexFeaturizer case_sensitive: true use_word_boundaries: false
  • name: CountVectorsFeaturizer stop_words:
    • a
    • and
    • any
    • are
    • aren’t
    • because
    • being
    • by
    • can’t
    • cannot
    • could
    • couldn’t
    • does
    • doesn’t
    • don’t
    • during
    • from
    • further
    • if
    • in
    • into
    • itself
    • let’s
    • more
    • of
    • or
    • other
    • ought
    • over
    • shan’t
    • some
    • such
    • than
    • that
    • that’s
    • them
    • themselves
    • this
    • those
    • through
    • under
    • until
    • up
    • very
    • where
    • where’s
    • which
    • while analyzer: word min_ngram: 1 max_ngram: 1
  • name: DIETClassifier epochs: 100
  • name: FallbackClassifier threshold: 0.5 ambiguity_threshold: 0.2
  • name: EntitySynonymMapper

policies:

  • name: TEDPolicy max_history: 3 epochs: 150 batch_size: 32 max_training_samples: 300
  • name: MemoizationPolicy
  • name: RulePolicy enable_fallback_prediction: ‘false’ restrict_rules: ‘false’ check_for_contradictions: ‘false’

Can you give me any documentation or anywhere where I can get a head start for the pythonic way of parsing nlp ?

I dont think there are any documentation on implementing the pythonic way with the latest rasa. you have to do it yourself but it is OSS you can simply check the code and walk through it on github. please keep in mind, i don’t think this is officially supported so fair warning.

Regarding your config, The biggest memory footprint is likely of tensorflow on your CPU… it doesn’t seem that your config is using pretrained models or anything… but i am surprised every model takes about 700Mb of space when running the API :frowning:

I have tried deploying it in alpine based docker container each model is around 700MB and when I deployed it through automated supervisord deployment it takes around 900MB but even supervisord only runs it through rasa run command with enable-api argument.

Can you tell me what’s the ideal memory requirement per model ?

Well i did some tests on my own and yeah my model shows about 500Mb of memory usage which also includes DIET

Tensorflow is a hard dependency of Rasa so i think it is safe to say, part of that memory footprint is tensorflow even when i use a non-tensorflow specific components such as spacy.

i dont see any specific hardware requirements for rasa oss, but there is a hardware requirement for Rasa X, 60-70% of which i believe is needed to run Rasa components which does the training and inference.

ok thank you so much for the info it was helpful. Will try the pythonic way of nlp parsing. Wish me luck :stuck_out_tongue:

1 Like

Also can you tell what’s the ideal time taken to load a model so as an alternate which I am thinking is loading the model on demand basis. If the model loading time is low enough I can go for that approach. What I have seen is around 30 seconds. Please do let me know your thoughts on it.

Yeah sounds about right. You can techincally use an LRU cache to cache your loaded model in the app in a least recently used rotation and thus that would reduce response times for subsequent calls

I am still facing this issue and your understanding is correct that I am using flask to interact with Rasa. I am caching the model generated by Interpreter.load(model_path) method by storing it in memory using a queue. I have added the code snippet which generates the model in the issue itself. Evenif I cache the model, I expected that memory consumption would increase by approximately 100-150MB as the model persisted in disk is around 50 MB. But in my case, its increasing by 1.5GB on an average with every training.

TargetPayandBenefits