Hi,
I have a Rasa Open source model deployed using docker compose on an AWS server with 32 cores.
Versions used : Rasa Open source image - rasa/rasa:2.8.14-full Rasa Action Server image - rasa/rasa-sdk:2.8.4
I’ve also added a mongo-db tracker store to the docker-compose.yml and they are working correctly with a simple custom connector.
I want to know if there is there is a way to run rasa in order to utilize all the cores ? In my testing with a large user count and each with 20~30 msgs, I see the avg. response time of my model increases significantly.
Testing Scenario used, simulating 100 users in parallel, each with around 30 messages. The avg. response time is almost 15~16 seconds which is way too high.
On monitoring the CPU usage of the 32 cores and also the docker container stats, I see that the bottleneck is the “Rasa” container running at ~100% cpu usage ie. only a single core is being utilized. (both the action-server and mongodb containers are always less than 5% cpu usage)
Any suggestions on how I should run this in order to utilize all the cpu cores and bring down the response time to 1~1.5 seconds ? Is this not the correct deployment method ? And what is the general Rasa open source only deployment option for high traffic chatbots ?
Any help is much appreciated.