When doing Rasa init and visiting the config.yml, our pipeline looks like this.
pipeline:
# # No configuration for the NLU pipeline was provided. The following default pipeline was used to train your model.
# # If you'd like to customize it, uncomment and adjust the pipeline.
# # See https://rasa.com/docs/rasa/tuning-your-model for more information.
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
constrain_similarities: true
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100
Is there a reason why we need two mentions of CountVectorFeaturizer in our pipeline?
@BrookieHub I hope you seen all this before, but still it will help you understand more, how NLP work.
- name: CountVectorsFeaturizer
Description:
Creates features for intent classification and response selection. Creates bag-of-words representation of user message, intent, and response using sklearn’s CountVectorizer. All tokens which consist only of digits (e.g. 123 and 99 but not a123d) will be assigned to the same feature.
This featurizer can be configured to use word or character n-grams, using the analyzer configuration parameter. By default analyzer is set to word so word token counts are used as features.
If you want to use character n-grams, set analyzer to char or char_wb .
What is n-gram?
An N-gram means a sequence of N words . So for example, “Medium blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram).
The lower and upper boundaries of the n-grams can be configured via the parameters min_ngram and max_ngram . By default both of them are set to 1
Summary:
First one used for the word level, and second one is used for the char level.
CountVectorsFeaturizer can additionally complement that if you have some very domain specific words. E.g. balance could mean very different things in finance vs general English
I hope this will give you more idea and solve your concern about using them Good Luck!