Need details on multi-thread architecture of rasa server

Hi, I was reading about capacity of rasa server to handle requests per second and this post Is rasa server running in multi-threaded way? says that rasa server can handle 20 requests per second. So I have following questions:-

  1. Is there any official architecture documentations and diagram exists that I can refer to get insights into how it can support 20 requests in parallel
  2. Which tools I can use to generate progressive load to find out its capacity to handle requests per second so that I can determine its scaling need? Any details how to hook up that tool with rasa server is appreciated. We are planning to deploy rasa bot for within my team to use as starting point before rolling out for larger user base. Thanks Rahul Srivastava

We also ran into this problem, it is quite surprising that rasa has such low throughput, we get around it by using Docker and Kubernetes with lots of pods. As far as load tools if you have a SIP interface you can use something open source like sipp or something commercial such as StarTrinity SipTester. We test up to 1000 sessions using audio recordings of a happy path. If your text-only I think you will need to build something.

1 Like
  1. Rasa runs using python’s asyncio framework asyncio — Asynchronous I/O — Python 3.9.2 documentation , which allows it to context switch while waiting for IO
  2. The easiest is if you have a channel that is accessed over HTTP, then you can use one of the multitude of HTTP load testing tools. Here’s an example of a benchmark that I set up: turn-rasa-connector/benchmark at develop · praekeltfoundation/turn-rasa-connector · GitHub
  1. We used Locust-based HTTP Load testing (see example here GitHub - nprovotorov/rasa-rest-api-loadtest: Basic tool (based on to perform load test of Rasa based chatbot via REST API.) - maybe it will help you

Hello @rssrivast ! How many workers are you using to start your server? I was having performance problems with the web server used by rasa (Sanic). It was handling 225 requests from 45 threads, each asking the bot 5 questions and the response worst time was nearly bellow 5s.

After reading a lot about Sanic and the rasa lock_store and started a server with 2 workers (instead of the default 1) and the performance surprised me. I was able to handle more than 5000 request from 1000 threads with response time bellow 4s.

In order to configure the lock_store you need to have a redis server instance and configure it on your endpoints.yml file, here are the docs.

Here is my endpoint.yml

    type: redis
    url: localhost
    port: 6379
    db: 0
    use_ssl: False
    socket_timeout: 10

Here what Sanic’s documentation says about workers.

To have more workers on rasa using Sanic, you should have a env variable called SANIC_WORKERS (not in rasa documentation), with the numbers of workers you want. Please, limit the amount of workers based on the available cores you have as described on documentation (link above).

Hope this helps you!

P.S.: Only worked using Linux, on Windows machine it raises an Error.


Thanks for all the details. This is very useful. I will update with my findings after I will implement. Rahul Srivastava

I have one more question which is even if I start multiple rasa server instances the bottleneck is still single threaded call between “rasa server” to “rasa action server”. Our “rasa action server” is making a blocking call to external APIs and it does not handle any other incoming calls until the request succeeds or finally times out. All the other requests have to wait until this. How we can make “rasa action server” non-blocking to other incoming calls even if the previous call is still active?

1 Like

You may also start the action-server with multiple workers via env variable ACTION_SERVER_SANIC_WORKERS

@rssrivast ,

A common root cause of poor performance is when the action server is using a synchronous method to call external services, for example with the requests package.

These type of calls block the asyncio logic and all other calls will wait.

Make sure to use async logic everwhere in your custom actions, for example, replace the requests package with the aiohttp package.

Thanks for suggestions. I will try it out.

@Arjaan Quick question, if we have a custom NLU component in pipeline which calls external service using requests package, should we also use aiohttp package

@gagangupt16 , yes, definitely use the aiohttp package, to make it async.

Else your external service calls will block everything else.

1 Like

@Arjaan I wonder if you guys have any example code that shows this. I’ve tried to do this using aiohttp from within an action and the await command still blocks.

@Randywreed ,

Here is a code snippet that compares using the requests module with the aiohttp module:

import os
import json
import logging
from typing import Any, Text, Dict, List, Union

import requests
import aiohttp
from rasa_sdk import Action, Tracker
from rasa_sdk.forms import FormAction
from rasa_sdk.executor import CollectingDispatcher
from import (

logger = logging.getLogger(__name__)

rasa_x_host = os.environ.get("RASA_X_HOST", "rasa-x:5002")

async def tag_convo(tracker: Tracker, label: Text) -> None:
    """Tag a conversation in Rasa X with a given label"""
    endpoint = f"http://{rasa_x_host}/api/conversations/{tracker.sender_id}/tags"

    logger.debug("Tagging a conversation at url=%s with data=%s", endpoint, label)

    if not USE_AIOHTTP:"using requests module")
        response =, data=label)"Response status code: %s", response.status_code)
    else:"using aiohttp module")
        async with aiohttp.ClientSession() as session:
            async with, data=label) as response:
      "Response status code %s", response.status)

@Arjaan That’s interesting. Does that work in an action? When I use this code which I believe is the same as yours I get an error "AttributeError: aexit_ (full traceback below):
Here’s the code:

async def run(
        dispatcher: CollectingDispatcher,
        tracker: Tracker,
        domain: Dict[Text, Any],
        ) -> List[Dict[Text, Any]]:    
        import asyncio
        import aiohttp
        async with aiohttp.ClientSession as session:
            import requests
            import json
            headers = {'Content-Type': 'application/json',}
            params = (('output_channel', 'latest'),)
            d = {"name" : "EXTERNAL_dry_plant", "entities": {"plant": "Orchid"}}

            async with,data=d,headers=headers) as resp:
                print (resp.status)

Here’s the error:

Exception occurred while handling uri: 'http://localhost:5055/webhook'
Traceback (most recent call last):
  File "/home/randy/.cache/pypoetry/virtualenvs/reminderbot-fAAFfjM8-py3.7/lib/python3.7/site-packages/sanic/", line 938, in handle_request
    response = await response
  File "/home/randy/.cache/pypoetry/virtualenvs/reminderbot-fAAFfjM8-py3.7/lib/python3.7/site-packages/rasa_sdk/", line 103, in webhook
    result = await
  File "/home/randy/.cache/pypoetry/virtualenvs/reminderbot-fAAFfjM8-py3.7/lib/python3.7/site-packages/rasa_sdk/", line 398, in run
    action(dispatcher, tracker, domain)
  File "/home/randy/.cache/pypoetry/virtualenvs/reminderbot-fAAFfjM8-py3.7/lib/python3.7/site-packages/rasa_sdk/", line 230, in call_potential_coroutine
    return await coroutine_or_return_value
  File "/home/randy/Rasa/reminderbot/actions/", line 107, in run
    async with aiohttp.ClientSession as session:
AttributeError: __aexit__

If i put it in its own async function, then it errors saying it needs an await. The only way I’ve been able to get this to work in action is to use requests in a mutiprocessor thread.

Lock stores are used to store conversation history so I presume you are utilizing more than just the NLU? I have a rasa server running where I only access the NLU and tried using the SANIC_WORKER environment variable, but after running locust, there was absolutely no change in performance. Could you share how exactly you were able to set the environment variable to get this performance?

Hey, I was trying to use this. but cannot see this variable in the rasa code base 3.2.8 . Is it removed by any chance?

Realized that this env variable is in rasa_sdk code where the custom action endpoint spins the given number of threads. Thank you for the info on this variable.

Hi, we are facing the same problem in my team. Basically, the point is that each request is executed sequentially so we are not getting 20 satisfied requests per second (you can see the response time increasing over time):

We are making a call to an external data server but we tested that solution and it is distributed.

We have tried changing the number of SANIC_WORKERS for both Core and Action servers but the performance doesn’t change.

We even tried to implement the redis solution suggested by @IRBraga (we just started a redis server and we configured the endpoint.yml as indicated in the answer above), but the stress tests don’t show any change in response time.

Do we have to add a configuration file for the redis server or do you suggest something else? Thanks in advance!

1 Like

Hi can you help me with this. How you got it to improve as even with change in this we are not able to handle more than 20 req/sec. Our rasa open source bot is deployed in kubernetes with no limit on resources. The CPU utilization is not going above 50 % to scale but the requests are failing.