First I want to thank RASA for explaining their underlying model. I am using the Tensorflow’s pipeline (i.e., supervised_embeddings) in my bot implementation. I went through the docs to understand how the algorithm works with an aim to improve my bot accuracy. I read the following article:
However, it was confusing for me the embeddings and classification of the intent. In particular, it was not clear for me:
1- What does it mean the intent label count vector?
2- If the user’s query will be embedded using the training set utterances features?
Finally, I read the “StarSpace: Embed All The Things!” paper, but I did not find answers to my questions.
@varton Not sure if I understand your questions correctly, but I’ll give you an answer as how I interpret them.
Count Vector is a Bag of Words featurization approach where the vector contains the number of times each word in the vocabulary is present in the text. Since here the text is intent label, the count vector will the corresponding vector for intent label
If you mean the user’s query at inference time, then no, user’s query will have it’s own computed vector. Only the vocabulary that was used during training will be shared.
@varton count vectors for user utterances and intent labels can be built from a shared or an independent vocabulary. It’s configurable in CountVectorFeaturizer. If it’s independent, then for user utterances the vocabulary is constructed with unique words across all utterances in training set and for intent labels the vocabulary is constructed with unique words across all intent labels in the training set.
In case, the vocabulary is shared, a common vocabulary is constructed with unique words across all user utterances and intent labels in training set.
What is the purpose of creating a count vector for the intent label? Is it enough to create the count vector for the user utterances associated with that intent?
@varton The intent label can also have useful tokens which can assist in learning an embedding for the intent. Also, incase of multiple intents, count vectorizer is a good way to handle multiple tokens in the intent label.
Thanks @dakshvar22 for clarifying things. I really wish that there is a solid example that describe the models in details to benefit other Rasa’s users.