When doing "rasa init", why does the config.yml file have two "CountVectorFeaturizer"?

When doing Rasa init and visiting the config.yml, our pipeline looks like this.

pipeline:

# # No configuration for the NLU pipeline was provided. The following default pipeline was used to train your model.

# # If you'd like to customize it, uncomment and adjust the pipeline.

# # See https://rasa.com/docs/rasa/tuning-your-model for more information.

  - name: WhitespaceTokenizer

  - name: RegexFeaturizer

  - name: LexicalSyntacticFeaturizer

  - name: CountVectorsFeaturizer

  - name: CountVectorsFeaturizer

    analyzer: char_wb

    min_ngram: 1

    max_ngram: 4

  - name: DIETClassifier

    epochs: 100

    constrain_similarities: true

  - name: EntitySynonymMapper

  - name: ResponseSelector

    epochs: 100

Is there a reason why we need two mentions of CountVectorFeaturizer in our pipeline?

By default (the first one), CVF works with words only. The second uses n-grams instead.

You can learn more about CVF in the Rasa Docs and in the Scikit-Learn Docs for more detail.

In the pipeline, every component will pass its output as the input of the next component, so the order matters.

2 Likes

@BrookieHub I hope you seen all this before, but still it will help you understand more, how NLP work.

- name: CountVectorsFeaturizer

Description:
Creates features for intent classification and response selection. Creates bag-of-words representation of user message, intent, and response using sklearn’s CountVectorizer. All tokens which consist only of digits (e.g. 123 and 99 but not a123d) will be assigned to the same feature.

- name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4

Description:

This featurizer can be configured to use word or character n-grams, using the analyzer configuration parameter. By default analyzer is set to word so word token counts are used as features.

If you want to use character n-grams, set analyzer to char or char_wb .

What is n-gram?

An N-gram means a sequence of N words . So for example, “Medium blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram).

The lower and upper boundaries of the n-grams can be configured via the parameters min_ngram and max_ngram . By default both of them are set to 1

Summary:

First one used for the word level, and second one is used for the char level.

CountVectorsFeaturizer can additionally complement that if you have some very domain specific words. E.g. balance could mean very different things in finance vs general English

I hope this will give you more idea and solve your concern about using them :slight_smile: Good Luck!

2 Likes