I was hoping you could share any details about the model architecture behind the end-to-end feature and, more specifically, how that interacts with the standard intent classification models?
Since the Rasa approach isn’t fully end-to-end ML (you’re still allowing for intent classification), I’m guessing the intent classification runs first, if that fails (intent match is below a confidence threshold) then the end-to-end model kicks in which takes the conversation history plus the latest utterance through the end-to-end model? I’m guessing it might be more complicated than that, but I couldn’t find any more explicit details.
Hi Anca, thanks for the response. The documents you posted look like they’re focused on how the feature is configured. I’m more interested in the guts of how the feature works - The specific ML model architecture, and how Rasa determines whether to do straight intent matching or use the context of the entire conversation. Do you have any of those specific ML model architecture details you can share?
I am realising that I should start making algorithm whiteboard content specifically on the implementation feature of e2e. One thing I can confirm is that it’s just an adaptation of TED under the hood. If you’re not familiar with TED you may enjoy these two algorithm whiteboard videos:
The main thing that happens in the end-to-end situation is that we send more data to TED. It’s not just the predicted intents/slots/entities. It’s now also the featurized text utterance that is sent along. These are the sparse/dense features that are also generated in your NLU pipeline.
From my current understanding this is the main difference, but there may be details that I am omitting here since I’ve not looked at the codebase in detail.
At training time, the data contains only either user text or user intent, not both. TED learns to do predictions with either one. So to ensure that the test distribution is the same as the training distribution, we run TED with a batch of 2 at inference time when a new user message comes in. One batch example where the last user message has the intent label that comes from the NLU pipeline, and one where it is featurized by its plain text (as in the picture). The text-based prediction is chosen if and only if its confidence is above some threshold and the maximum similarity score is higher than that for the intent based prediction. See here. This is ok because the similarities come from exactly the same model. We then store which choice was made, so at the next dialogue step, the dialogue history is featurized according to these decisions made (intent label or text for each turn, even though both would be available at inference time).