RASA Word Embeddings Confusion

Hi all,

First I want to thank RASA for explaining their underlying model. I am using the Tensorflow’s pipeline (i.e., supervised_embeddings) in my bot implementation. I went through the docs to understand how the algorithm works with an aim to improve my bot accuracy. I read the following article:

However, it was confusing for me the embeddings and classification of the intent. In particular, it was not clear for me: 1- What does it mean the intent label count vector? 2- If the user’s query will be embedded using the training set utterances features?

Finally, I read the “StarSpace: Embed All The Things!” paper, but I did not find answers to my questions.

Thanks in advance :slight_smile:

@varton Not sure if I understand your questions correctly, but I’ll give you an answer as how I interpret them.

  1. Count Vector is a Bag of Words featurization approach where the vector contains the number of times each word in the vocabulary is present in the text. Since here the text is intent label, the count vector will the corresponding vector for intent label

  2. If you mean the user’s query at inference time, then no, user’s query will have it’s own computed vector. Only the vocabulary that was used during training will be shared.

Let me know if you have anymore questions

Thanks @dakshvar22 for your reply.

Regarding the first point, for example, if I have “Weather” intent that include utterances as follows:

  • What is the weather?
  • Show me the weather
  • Display the weather forecast

Then, the count vector of the “Weather” intent will be the number of times the utterances words appear in the vocabulary?

To ensure that I understand correctly, the vocabulary is constructed using the unique words in the entire training set?

For the second point, you have answered my question :slight_smile:

Thanks

@varton count vectors for user utterances and intent labels can be built from a shared or an independent vocabulary. It’s configurable in CountVectorFeaturizer. If it’s independent, then for user utterances the vocabulary is constructed with unique words across all utterances in training set and for intent labels the vocabulary is constructed with unique words across all intent labels in the training set. In case, the vocabulary is shared, a common vocabulary is constructed with unique words across all user utterances and intent labels in training set.

Thanks @dakshvar22 for your quick reply.

I have one more final question :slightly_smiling_face:

What is the purpose of creating a count vector for the intent label? Is it enough to create the count vector for the user utterances associated with that intent?

@varton The intent label can also have useful tokens which can assist in learning an embedding for the intent. Also, incase of multiple intents, count vectorizer is a good way to handle multiple tokens in the intent label.

Thanks @dakshvar22 for clarifying things. I really wish that there is a solid example that describe the models in details to benefit other Rasa’s users.

hi, i’m trying to understand but still dont get it. Would you help me to correct ? i’m afraid my missunderstood gone deeper

  • shared vocabulary : pretrained embedding

  • independent vocabulary : no pretrained or word vector source used in pipeline

  • user utterance = user message

  • vocabulary is constructed with unique words across all utterances in training set” = whole nlu.md file

  • "intent labels the vocabulary is constructed with unique words across all intent labels in the training set " = also whole nlu.md file ?

Thanks :slight_smile: