Hi Patrick.
I was working on content to explain the overview of the pipeline better. So let me try to connect some dots.
In Rasa, the NLU pipeline is trying to predict intents and entities.
The pipeline starts with text on one end but it is processed by multiple steps in the pipeline before we have our predictions. One of the important parts is to take tokens (extracted from the text) and to add features to them.
What features we attach depends on the steps in the pipeline but generally we generate two types of features:
- Sparse Features: usually generated by a CountVectorizer. Note that these counts may represent subwords as well. We also have a LexicalSyntacticFeaturizer that generates window based features useful for entity recognition. When combined with spaCy it can be configured to also include part of speech features.
- Dense Features: these are many pre-trained embeddings. Commonly from SpaCyFeaturizers or from huggingface via LanguageModelFeaturizers. If you want these to work, you should also include an appropriate tokenizer in your pipeline. More details are in the documentation.
Besides features for tokens, we also generate features for the entire utterance. This is sometimes also referred to as the CLS token. The sparse features in this token are a sum of all the sparse features in the tokens. The dense features are either a pooled sum/mean of word vectors (in the case of spaCy) or a contextualised representation of the entire text (in the case of huggingface models).
Note that there’s a community maintained project called rasa-nlu-examples that contain many experimental featurizers for Non-English languages. It’s not part of the main Rasa repository but can be of help to many users as there are over 275 languages supported. That library also supports gensim and GloVe embeddings.
What I hope is becomming clear here is that you can pretty much attach embeddings as you fee fit. The original video mentions GloVe because it was used in a benchmark but you can pretty much attach any features as long as you keep it compatible with the Rasa API.
After featurization we have the DIET model. It can take the features from the tokens as well as the entire sentence to predict intents/entities.
Now, just to emphesize on an example. Let’s talk about how these components interact with eachother.
The way the pipeline passes information along is via a Message
processing system. A message is like a dictionary that changes as components process them.
Because components keep adding/replacing information you can easily attach extra models. Typically you’d add extra entity extractors that are specialized towards a certain task. Let’s say you’re using the RegexEntitiyExtractor
to attach intents via a name-list. Then the message might expand like so:
If you’re interested in exploring what’s happening more directly, you might like to play around with the Printer
object from rasa-nlu-examples
. It’s documented here and it gives information about the Message
. An example is shown below.
{
'text': 'rasa nlu examples',
'intent': {'name': 'out_of_scope', 'confidence': 0.4313829839229584},
'entities': [
{
'entity': 'proglang',
'start': 0,
'end': 4,
'confidence_entity': 0.42326217889785767,
'value': 'rasa',
'extractor': 'DIETClassifier'
}
],
'text_tokens': ['rasa', 'nlu', 'examples'],
'intent_ranking': [
{'name': 'out_of_scope', 'confidence': 0.4313829839229584},
{'name': 'goodbye', 'confidence': 0.2445288747549057},
{'name': 'bot_challenge', 'confidence': 0.23958507180213928},
{'name': 'greet', 'confidence': 0.04896979033946991},
{'name': 'talk_code', 'confidence': 0.035533301532268524}
],
'dense': {
'sequence': {'shape': (3, 25), 'dtype': dtype('float32')},
'sentence': {'shape': (1, 25), 'dtype': dtype('float32')}
},
'sparse': {
'sequence': {'shape': (3, 1780), 'dtype': dtype('float64'), 'stored_elements': 67},
'sentence': {'shape': (1, 1756), 'dtype': dtype('int64'), 'stored_elements': 32}
}
}
Let me know if this helps or if there’s still gaps in your knowledge.