Recovery and Scalability for Rasa Bot Service(s)

I am new to Rasa and have just done a POC, We are very excited to take it forward and build full application around it. Before we do that, we have some very fundamental questions around Scalability and Recovery. Answer to these will help us architect a solution better:

Scalability

  • How does rasa bot service scale across 100s or 1000s of parallel conversations: * Can we run many instances Rasa behind a load balancer to scale? * Should we build a gateway service that will play the role of creating a sticky session between user and one of many bot instances?

Recovery

  • In the event the bot service holding conversation with the user goes down, Is there a way to recover from this scenario, ie Is there a way to continue conversation on another bot service?
  • Is there a way to replicate/send conversation state and slot for a conversation id to continue on a different bot instance to continue conversation with user?
1 Like

Hi @sundeep_misra, welcome to the forum!

How does rasa bot service scale across 100s or 1000s of parallel conversations?

That will depend on how active your concurrent users are. We’ve measured that a single, non-replicated Rasa instances can handle around 20 messages / second.

Can we run many instances Rasa behind a load balancer to scale?

Yes, Rasa is built to run as a scalable service, so you can replicate the rasa-production containers behind behind your load balancer.

Should we build a gateway service that will play the role of creating a sticky session between user and one of many bot instances?

This isn’t necessary: We’ve recently introduced a ticket lock mechanism which ensures conversations are locked at the time of processing and incoming messages are dealt with in the right order, regardless of which of your replicas receives it. It’s called the RedisLockStore and you can check out the docs here.

In the event the bot service holding conversation with the user goes down…

If an instance handling a user conversation goes down, your container orchestrator just won’t send any more messages to that instance. Another instance will then receive the next message and pick up the conversation where it left off. Any message that was already being processed (as opposed to having been queued and waiting to be processed) while your bot service fails will be lost though.

Is there a way to replicate/send conversation state and slot for a conversation id to continue on a different bot instance to continue conversation with user?

As said in the previous answer, that won’t be necessary. The state of the conversation is persisted to database, so you won’t have to share the conversation state between instances.

I hope that helps!

@ricwo,

Thanks, this is very helpful. I have one more question:

I will be in a stuation with multiple domain bots, I dont want to be in a situation where i deploy separate instance of bot on separate server. Is there a way to have multiple domain bots in one server and have ability to invoke bot by domain from one sever url?

Thanks Sundeep

Hey @ricwo, thank you for this thorough answer. I have a follow up question:

  1. if we use Redis as the TrackerStore or the LockStore, what will happen if Redis loses data ? The store will not be functioning?

  2. By experimenting, I found that by only using a Redis Tracker store, the failover is taken: I started two bot instances b1 and b2 and they are listening to port p1 and p2. In the middle of the conversation with b1, I killed it and tried to continue the conversation with b2 and it works! Is this expected? If yes, what is the use of the LockStore? I am quite confused by only looking at the doc: https://rasa.com/docs/rasa/api/lock-stores

  1. Can you specify what you mean? If redis doesn’t work neither the RedisTrackerStore nor the RedisLockStore will work.

  2. Yes that is expected - it doesn’t matter where you continue your conversation. You can send message 1 to instance A, message 2 to instance B and so on - the lock store ensure they’re processed in order. As I mentioned, the ticket lock is a mechanism which ensures conversations are locked at the time of processing and incoming messages are dealt with in the right order, regardless of which of your replicas receives it. The LockStore just holds and manages these ticket locks across multiple instances using Redis as a persistence layer.

@sundeep_misra no such routing is possible out of the box within one server at the moment

  1. yes. Redis will loss data if you try to persist them. Just want to make sure if this is handled by Rasa. Seems like no.
  2. I see. Understanded what is the lockStore. Thanks!