We are using Rasa Community to build out Messaging System however, we aren’t able to scale it up to a decent concurrent users maybe around 5000 , while reading via various blogs & forum topics few have suggested Redis Lock Store or containerise Rasa, however I find very limited information available, some suggestions & recommendations around the same to have scalability would be much appreciated
what exactly is the issue you are facing with concurrent requests.
5000 concurrent users is a LOT!! concurrent would mean that there are 5000 active user to your bot sending message at the same time which would mean you have about 100k messages coming every minute, you can always mock this using tools like locusts.io to get a fair idea of what kind of a scaled out architecture you are looking at but this is indeed quite a lot. that would techincally make your bot the successful engagement tool ever. be careful of the fall out from other dependencies that you must scale too like action server, duckling server etc…
For starters, you would need a load balancer infront of rasa servers. you can deploy 3-4 rasa pods(on a distributed framework like k8s, this is more intuitive to scale but ofcourse, spinning up 4 VM does the same). Lock store is also defnitely needed because you have 4 rasa servers running in parallel which means you dont want any of them to de-sync with the conversation. Same goes for tracker store. Use Redis for it’s wonderful O(1) retrieval time. gives you better performance than mongoDB or SQL but ofcourse it is more expensive so create mechanisms to clean the tracker store after a given event like session close for example.
use a load testing tool, to see how much median response time each server can take. simulate 2000-3000 requests/second. You can also monitor pod health with respect to CPU/Memory and thus scale accordingly.
can you share any references related to scaling Rasa via Docker
i am not sure what kind of references you are asking for?
you scaling in general with docker? Rasa is no different than any application server
I am more curious to understand do we necessarily need Redis for Lock store to scale up or we can sustain with In Memory as RASA isnt distributed in nature as I understand , how does Rasa behave then?
Usually you will have one or more server running a rasa web server(sanic), each server receives a request( user message) and with a conversation ID, ideally that’s the identifier of the conversation that tells what has happened before and what to do next for that conversation ID,
if you have only server processing request, then for every request for this conversationID, it will create a ticket lock InMemory of that server that makes sure that no other incoming message can be processed yet, it will understand the order of the conversation and reply what to do next and preserve the order of incoming messages.
when you scale, you have multiple such server and any one of them can receive message for the same conversationID because that’s how load balancer works, it distributes load across em. you can create sticky session and ensure for a given conversationID, if a session is active send the request to that particular machine and not the others. but this is too complicated.
thus to avoid that two servers are processing two messages of the conversationID in parallel, you would have a shared lock store such Redis that tells every server that there is a lock on the conversationID because one of them is currently processing a request and thus if other messages arrive to the different server they would wait until lock is released. this will preserve the integrity of the conversation. users don’t necessarily type one message and wait for an answer and thus you need to preserve the order.
thus a lock store is needed to preserve when you run rasa in parallel on more than one server.