Purpose of Embedding Model Training in Rasa Pro

I recently came across the Rasa Pro training process at the following link.

From my understanding, the defined flows are converted into documents and then embedded to train the embedding model. I noticed that by default, Rasa Pro utilizes the text-embedding-ada-002 model from OpenAI. However, I’m curious about the purpose of training the text-embedding-ada-002 model with these documents. Specifically, if we are embedding these docs, what is the intended outcome of this process? Additionally, is there a specific loss function or labels associated with this data during training?

From my understanding, the defined flows are converted into documents and then embedded to train the embedding model.

Not exactly. The flow embedding and vector search are used to cap the number of flows presented to the LLM in the inference step.

If a bot has a large number of flows, we cannot send the description of every flow to the command generator and let the LLM choose one–the message is too big. The fix for this problem is to reduce the number of flows presented to the LLM to a manageable number, but how do we choose which flows to send to the LLM and which to ignore?

We choose by creating a semantic embedding of each flow in the training step and storing it in a vector database. When the LLM is consulted, only the most semantically relevant flows are retrieved and sent for consideration.

(I know your screenshot is from this document, but here’s a link in case anyone wants to read more: https://rasa.com/docs/rasa-pro/concepts/components/llm-command-generators/#retrieving-relevant-flows)

Could you please clarify whether the trained model (e.g., 20240703-105442-massive-grappa.tar.gz) consists of the semantic embeddings for each flow, or if it represents the embedding model itself, such as text-embedding-ada-002?

The trained model does include the semantic embeddings for each flow. I recommend unpacking the model file as an exercise to look at its components. You can see the FAISS data that represents the semantic encoding of the flows, as well as the trained representation of the flows themselves.

1 Like