Tracker Store Selection

We are looking to set up a Rasa installation, and I was wondering if there was any guidance over how and when you would choose the different options for the Tracker Store. For example, given the choice of PostgreSQL, MongoDB, and Redis, why would you choose one over another? I can use any of them.

My guess is that the recommendation is PostgreSQL, because that is what I see in the architecture diagrams and in the Helm Chart. My second guess is that is so you can do easy queries against the data.

So why would you choose MongoDB or Redis (or Dynamo for that matter) over PostgreSQL for just Rasa? If we already had one of them, I can see why, but when starting from scratch…

Thanks!

This might be a bit late, but we’ve recently done an investigation into the different backends, so I thought I would share the results.

Keep in mind that this investigation was very specific to the needs of our bot, so the needs may differ depending on requirements. Something that was a con for us might be a pro for you, and vice versa.

SQL:

This stores each event as a row in a table. It has an integer row id; char sender id, type, intent, and action; a float timestamp, and a data text field, which contains a json serialised version of the event

Pros:

  • Supports postgresql, which we have operational knowledge to run
  • Only fetches the list of events from the last session start

Cons:

  • Keeps data forever, the database will continuously grow. We’ll have to create a tool to archive old data. This shouldn’t be too difficult, since each event is stored as a separate row.
  • There are only two indexes, on the integer id, and on the sender id, which will cause slowdowns for users that interact a lot with the bot. It shouldn’t be too difficult to add indexes, but it’s something we’ll need to remember to do. We can look at maybe making a PR to rasa to add creating more indexes to the tracker store code.

In Memory:

I wouldn’t consider this one for production, it won’t allow scaling past a single rasa instance.

Redis:

This stores events as just key: value, where the key is the user ID, and the value is a string: json serialized object of the list of events

Pros:

  • It allows setting an expiry, so that we automatically don’t keep data forever

Cons:

  • It always just appends events, so the more the user interacts, the larger the values are. This can be a problem if users continually interact before our expiry, so the keys never get expired, and the list of events keeps growing, which seems like our use case. Cleaning this up is really difficult, because we’ll have to go through each key, deserialise the JSON, remove old entries, and then update the key, all while somehow avoiding race conditions.

Dynamo:

The key is composed of the sender id, and the session_date. The session date is a bit misleading, as that is not the same for the whole session, it is just the timestamp when the row is inserted.

It adds a new document for every message.

I didn’t look at testing this locally, since it’s not something you can host on your local machine, but looking at the source code, it looks like it stores the entire event history.

This means for every message, it gets the list of events, processes and adds events to that list, and then adds a new row with the new entire list of events.

Pros:

  • Something we don’t have to manage
  • Cheap
  • Can have full conversation history for analysis, and not have to worry about performance of such a large database

Cons:

  • Stores the entire history for each user, so as a user uses the service more, it will slow down. Removing history will be easier here, we fetch the latest key for the user, clean up the history, and insert a new key with the cleaned up history. Any race condition issues will just mean that the user’s history doesn’t get shortened, and will hopefully get shortened on the next run.
  • It never updates keys, instead only inserts keys, so storage size will grow quite quickly. This could be something that we include in the cleanup logic, to delete any older keys that we no longer require.
  • This isn’t an open source tool that we can host ourselves, we have to rely on amazon

MongoDB

There’s an example custom tracker store for MongoDB that allows limiting the amount of history that we store: Using a custom tracker store to manage max event history in RASA. | by Simran Kaur Kahlon | Gray Matrix | Medium

It creates a single document per sender ID, and has an index on the sender ID. It updates the data for each new message. When it updates the document, it updates the data, and appends to the events.

Pros:

  • It only deals with the events from the current session
  • It appends only new events, instead of overwriting the document every time

Cons:

  • We don’t have any organisational experience in running MongoDB
  • Although it only deals with events for the current session, the way it does this is it loads all the events for the current user in memory, and then filters them in python, which will lead to slowdowns for users that use the service often, over time.
  • It stores all events for all time. We could create something to archive older events, and we could avoid race conditions by not touching anything in the latest session.
2 Likes

As an addition to this, we have created an archiver service for PostgreSQL, which will archive older data from postgres into S3: GitHub - praekeltfoundation/rasa-postgres-archiver: Archives older data from the Rasa PostgreSQL tracker store into S3