Rasa Pipeline Doubt

Hi ! I’m new to Rasa .I’ve the following Rasa NLU pipeline and it is also the default pipeline provided by rasa .

Configuration for Rasa NLU.

Components

language: en pipeline:

  • name: WhitespaceTokenizer
  • name: RegexFeaturizer
  • name: LexicalSyntacticFeaturizer
  • name: CountVectorsFeaturizer
  • name: CountVectorsFeaturizer analyzer: “char_wb” min_ngram: 1 max_ngram: 4
  • name: DIETClassifier epochs: 100
  • name: EntitySynonymMapper
  • name: ResponseSelector epochs: 100

My question is why is CountVectorsFeaturizer mentioned twice? i know about the functionality but didn’t get why are we having 2 CountVectorsFeaturizer. Also is the output of one element of pipeline serves as input to other? Thanks in advance.

Hi and welcome to the Rasa community!

Yes, a pipeline consists of a sequence of components which are executed one after another. So the order of the components matters.

The default pipeline indeed uses two instances of CountVectorsFeaturizer . The first one featurizes text based on words (as you can see here, "words" is the default value for analyzer). The second one featurizes text based on character n-grams, preserving word boundaries. We empirically found the second featurizer to be more powerful, but we decided to keep the first featurizer as well to make featurization more robust.

To learn more about pipelines in general, have a look at our docs on Choosing a Pipeline.

2 Likes

Got it . Thanks .