While doing some benchmarking, with our bot we are only able to process 10msgs/s per instance, with that instance at 100% CPU usage. The actions server, postgresql, and redis usage remains low during the benchmarking.
Is this the kind of performance to expect, or are there ways that we can improve the performance of the bot? It seems very slow, and it will be very expensive to scale up this bot to production levels if we’re only getting 10msgs/s per cpu.
Doing some profiling, it seems like ~45% of the CPU time is being spent in the tracker store, that drops to around 40% using a local sqlite file. ~35% of the CPU time is being spent inside tensorflow, and the remainder inside misc python things, like the featurizer.
But I am getting similar enough performance results on my local machine, compared to running on the VMs, that I think it’s a good enough comparison.
What I’m wondering is, is this the kind of per-core performance we should be expecting, or are there ways to increase the efficiency of the bot by tweaking some things, and if so, what are the things that we should look at tweaking?
Hi!, I came here also looking for Performance benchmarks and other kind of metrics. Our rasa instance is also experiencing some latency. How are you measuring the CPU consumption and other metrics per rasa component/function?
It’s strange that it’s only for some utterances, I would expect the NLU, etc, to have consistent processing times.
If it’s only happening for certain users, it could be the tracker store. Most tracker stores store and retrieve the entire history to process any message. The reason we chose the SQL tracker store is that it only retrieves history for the current session. Dialogue tracker gets too big
If it’s happening for certain actions, especially for custom actions, then I would look what those are doing, and maybe benchmark and profile those actions to see where you can make improvements.
Running the test multiple times without resetting the tracker store, start seeing performance degradation after 10 interactions, by 20 interactions we’re down to 10req/s (further 22% reduction).
Hello everyone. I am also having trouble scaling our production bots. I have not profiled the code as thoroughly as rudi. But i’ve used locust to benchmark the our custom http channel. Our bots are real time so we need sub 500ms latencies and each process isn’t able to handle much traffic. The latencies start going up linearly after 4-5 concurrent conversations. With 50 users we see around 4 sec delays. We are using redis tracker store and redis lock store. The NLU pipeline also involves duckling.
Though I have rewritten the duckling component and the Interpreter classes to be async , I think the culprit could be redis tracker and the lock store, since they are not async capable and whenever the Agent tries to fetch or store the tracker it blocks the event loop. As the concurrency increases the latencies pile up and we see those latencies in our conversations. We are planning to rewrite the redis tracker store and the redis lock store using aioredis. What are your thoughts @Tobias_Wochinger ?
I would look at profiling it, to see where the bottleneck is, so you know where to focus your effort.
You could look at making Redis access async, but I would be surprised if Redis is taking long to reply, unless you have very high latency to your Redis instance.
What’s more likely, and the reason we went with the postgresql tracker store, is that the redis tracker store stores all history for all users, and never truncates that history. So the more your users use the bot, the slower it becomes. It also means that for every message, you have to serialise and deserialise a massive amount of history to and from Redis.
An easy way to test this is to look at your CPU usage once the delays start going up linearly. If you’re close to 100% usage, that means that changing to async won’t help, you’re CPU bound. If CPU usage is low, then something is likely blocking the loop.
Also check your duckling server response times, that could also be a bottleneck, since all processes will be trying to simultaneously trying to access the same duckling server.
You are right rudi. I profiled the process using py-spy and a chunk of the time is spent on policy and NLU prediction. And that’s straight up CPU time. The time also goes up linearly with the number of concurrent users. I have profiled duckling too. Get low latencies latencies in our cluster even with 500 concurrent calls. I looked at my cpu with 1 worker and 4-5 concurrent users. It shoots up to 100% and py-spy shows a significant of time spend in NLU pipeline (excluded duckling from pipeline) and TED policy prediction, which are purely cpu-bound tasks and hence the event loop gets blocked.
Ah I see, yes nothing much you can do in that case, besides increasing the amount of processes/workers that you’re running, and load balancing between them.