Performance of a Production bot

I’ve got a Rasa bot using 1.10.2, source here: GitHub - praekeltfoundation/healthcheckbot , that I’m having some difficulties scaling.

We’re using PostgreSQL for our tracker store, and Redis for our lock store, and a custom connector (GitHub - praekeltfoundation/turn-rasa-connector: A Rasa Connector for https://www.turn.io/), although we get the same performance using the Rest connector, so I don’t think that’s where the performance issue is.

While doing some benchmarking, with our bot we are only able to process 10msgs/s per instance, with that instance at 100% CPU usage. The actions server, postgresql, and redis usage remains low during the benchmarking.

Is this the kind of performance to expect, or are there ways that we can improve the performance of the bot? It seems very slow, and it will be very expensive to scale up this bot to production levels if we’re only getting 10msgs/s per cpu.

Doing some profiling, it seems like ~45% of the CPU time is being spent in the tracker store, that drops to around 40% using a local sqlite file. ~35% of the CPU time is being spent inside tensorflow, and the remainder inside misc python things, like the featurizer.

4 Likes

Thanks for sharing your profiling results. Super interesting :pray:

In which setting are you running these tests? Is it a single machine / K8s setup and what are the specs of the machines?

To me it makes sense that inference is computational intense, but the tracker store definitely occupies too much time in there.

The profiling is run on my local machine, single instance, which is an i5 2014 MacBook Pro.

For benchmarking, that is done in a k8s setup, using Azure D8s v3 VMs, Dv3 and Dsv3-series - Azure Virtual Machines | Microsoft Docs

But I am getting similar enough performance results on my local machine, compared to running on the VMs, that I think it’s a good enough comparison.

What I’m wondering is, is this the kind of per-core performance we should be expecting, or are there ways to increase the efficiency of the bot by tweaking some things, and if so, what are the things that we should look at tweaking?

@rudi I think you’ve been in touch with Mady from our team to set up a call about this before - are you still up for discussing this in a call?

@akelad The call wasn’t specifically for this, but happy to add this to the list of things. I’ll contact Mady to set up a time for a call.

1 Like

Hi!, I came here also looking for Performance benchmarks and other kind of metrics. Our rasa instance is also experiencing some latency. How are you measuring the CPU consumption and other metrics per rasa component/function?

@MStanwood In terms of setting up the benchmark, here is some information about how I did that: turn-rasa-connector/benchmark at develop · praekeltfoundation/turn-rasa-connector · GitHub . It will differ depending on the connector you use. You can probably simplify it by using the REST connector, if you don’t want to benchmark the performance of your connector.

In terms of profiling, I used python’s cProfile. So I ran my bot using something like:

python -m cProfile -o out.profile /path/to/venv/lib/python3.6/site-packages/rasa/__main__.py run

And then I used SnakeViz to plot the resulting profile, and investigate the results.

snakeviz out.profile
4 Likes

Thanks… will definitely try the benchmark Our rasa doesn’t feel super slow… but for some utterances it does So we really need to investigate

It’s strange that it’s only for some utterances, I would expect the NLU, etc, to have consistent processing times.

If it’s only happening for certain users, it could be the tracker store. Most tracker stores store and retrieve the entire history to process any message. The reason we chose the SQL tracker store is that it only retrieves history for the current session. Dialogue tracker gets too big

If it’s happening for certain actions, especially for custom actions, then I would look what those are doing, and maybe benchmark and profile those actions to see where you can make improvements.

1 Like

That’s definitely the next step… Taking a look to custom actions and get some metrics.

Hey rudy! Thanks for the snakeviz tip… we used it today and it gaves a loto of visivility. :slight_smile: owe you a beer!

1 Like

Doing some further testing, enabling each external dependancy one at a time, and also empty vs non-empty tracker store:

Mid 2014 macbook pro, redis and postgresql running locally, clear stores before every test run

  • rest channel, in-memory lock, in-memory tracker store: 23req/s
  • rest channel, redis lock, in-memory tracker store: 23req/s
  • rest channel, redis lock, postgresql tracker store: 17req/s (26% reduction)
  • custom channel, redis lock, postgresql tracker store: 15req/s (9% further reduction)

Running the test multiple times without resetting the tracker store, start seeing performance degradation after 10 interactions, by 20 interactions we’re down to 10req/s (further 22% reduction).

I am also facing this issue. I found that the issue is with post call made to action server.

Hello everyone. I am also having trouble scaling our production bots. I have not profiled the code as thoroughly as rudi. But i’ve used locust to benchmark the our custom http channel. Our bots are real time so we need sub 500ms latencies and each process isn’t able to handle much traffic. The latencies start going up linearly after 4-5 concurrent conversations. With 50 users we see around 4 sec delays. We are using redis tracker store and redis lock store. The NLU pipeline also involves duckling.

Though I have rewritten the duckling component and the Interpreter classes to be async , I think the culprit could be redis tracker and the lock store, since they are not async capable and whenever the Agent tries to fetch or store the tracker it blocks the event loop. As the concurrency increases the latencies pile up and we see those latencies in our conversations. We are planning to rewrite the redis tracker store and the redis lock store using aioredis. What are your thoughts @Tobias_Wochinger ?

I would look at profiling it, to see where the bottleneck is, so you know where to focus your effort.

You could look at making Redis access async, but I would be surprised if Redis is taking long to reply, unless you have very high latency to your Redis instance.

What’s more likely, and the reason we went with the postgresql tracker store, is that the redis tracker store stores all history for all users, and never truncates that history. So the more your users use the bot, the slower it becomes. It also means that for every message, you have to serialise and deserialise a massive amount of history to and from Redis.

An easy way to test this is to look at your CPU usage once the delays start going up linearly. If you’re close to 100% usage, that means that changing to async won’t help, you’re CPU bound. If CPU usage is low, then something is likely blocking the loop.

Also check your duckling server response times, that could also be a bottleneck, since all processes will be trying to simultaneously trying to access the same duckling server.

1 Like

You are right rudi. I profiled the process using py-spy and a chunk of the time is spent on policy and NLU prediction. And that’s straight up CPU time. The time also goes up linearly with the number of concurrent users. I have profiled duckling too. Get low latencies latencies in our cluster even with 500 concurrent calls. I looked at my cpu with 1 worker and 4-5 concurrent users. It shoots up to 100% and py-spy shows a significant of time spend in NLU pipeline (excluded duckling from pipeline) and TED policy prediction, which are purely cpu-bound tasks and hence the event loop gets blocked.

Ah I see, yes nothing much you can do in that case, besides increasing the amount of processes/workers that you’re running, and load balancing between them.