How to reduce memory footprint in rasa open source?

Problem Statement: I want to split pipeline processes across different servers to reduce the load. Proposed Idea:

Even with the initially trained rasa model it eats up ~1.5gb ram, apparently this was happening because of how the model is implemented via tensorflow.

  1. Does that mean there is no way to run multiple instances of rasa on the server without hogging up ~1.5gb ram per instance?
  2. If I train rasa nlu only then can I run rasa core only(i couldn’t find a command yet to use core only) to reduce that ram usage? 3.Is there some other way where I can reduce the memory to say around 500mb per instance? Thanks

If you want to run multiple rasa models at once, You can try handling Rasa models via Python. For example create a Python server in a framework of your choice and load all the models. I think the only time memory size would increase is during the processing of the query.

An example of how you can train and deploy Rasa models is shown here Rasa NLU models in Python

**

Note that the blog is for handling NLU models but you will get an idea how you can load your complete model as well.

**

Thank you for the response I have looked at the given code and will try it out meanwhile I would like to know what is the standard hardware recommendation from rasa to serve multi client from a single server. (If recommended)

Use case: To serve multiple clients of cross domain on a single server.

How do players in the industry currently handle multiple clients?

Currently Rasa does not provide feature to handle multiple models at once, And best way that I found to go over this problem statement is creating a python application as Rasa library is available only in python and is very easy to implement

Hi Anoop, is there a difference in how Rasa Open source vs Pro/Enterprise consume the memory footprint.? Our development team has been training a conversational bot and we have upgraded to 1 tb memory on AWS infrastructure that still fails.? it just seem very excessive and not sure if it has to do with the open source vs “other” versions? it has fairly complex story blocks with checkpoints and intents but unsure.? thanks

Hi @Stepsvd ,

For this, You’ll have to connect with Rasa Enterprise team as they will be able to tell you about this in a much better way.

Also is it just one model that’s taking 1TB memory ?

Thanks Anoop yes just 1 model, is there someone on ENT team perhaps you can refer.?

@Stepsvd Just curious to know the below data points.

  1. what is the volume of training data - Number of intents , entities & training examples
  2. what is the compressed size of model ?
  3. Is it taking up 1 TB RAM when the service is started.

we have been serving multiple clients via different models as different services on the same EC2. But the size is way smaller than yours.

Hi @siriusraja, are you serving multiple clients through single EC2 server?

Exactly @siriusraja, I am curious on these points as well. As none of the models that i have created got this big or started consuming this much memory.

I have one more question

How are you serving multiple models ?

  • Different Rasa servers ? but this case would create so many ports and endpoints and rasa services

  • Or handling multiple models via python: This is what i have done in my project handling multiple rasa models

Hi @anoopshrma

we have starting rasa services on different ports for each client which uses the respective model. Ofcourse, this is approach would end up opening multiple ports. But that manageable.

If you want, you can put a nginx like proxy which can listen on single port and internally route to different client services.

@faraz

Yes, we have multiple clients but the concurrent traffic is not high. Its spread across different hours of the day. So we were able to run multiple services servicing different clients.

Each client model takes upto 1 - 1.5 GB of RAM.

Hi @siriusraja

But will it be easy if you have to handle 100 clients (let say) at once. That would mean to manager 100 rasa services.

I’m just curios, will it not be good if you could handle models at one place ?

Just want your thoughts on this

@anoopshrma

Its a trade off between administrative overhead of managing multiple services & flexibility.

When i run as different services, I can start / stop the services for each client when the business hours start in their time zone. This would reduce the memory overhead in te server.

Do you see any benefits in terms of Memory / CPU savings in the single service approach ?

Hello sir @siriusraja , we are also deploying our service like in your case. However, since every customer has been spam creating new bot with their account, we have been struggling to keep the resource consumption under control.

Is there a way to load the pre-trained language model as an internal server to parse features to other components, instead of loading it for each and every bot instance?