Hi @cmills
We don’t have good material on this, unfortunately, as the feature is still experimental. On a high level, what TED does with plain text during training is this:
At training time, the data contains only either user text or user intent, not both. TED learns to do predictions with either one. So to ensure that the test distribution is the same as the training distribution, we run TED with a batch of 2 at inference time when a new user message comes in. One batch example where the last user message has the intent label that comes from the NLU pipeline, and one where it is featurized by its plain text (as in the picture). The text-based prediction is chosen if and only if its confidence is above some threshold and the maximum similarity score is higher than that for the intent based prediction. See here. This is ok because the similarities come from exactly the same model. We then store which choice was made, so at the next dialogue step, the dialogue history is featurized according to these decisions made (intent label or text for each turn, even though both would be available at inference time).