My question is why is CountVectorsFeaturizer mentioned twice? i know about the functionality but didn’t get why are we having 2 CountVectorsFeaturizer.
Also is the output of one element of pipeline serves as input to other?
Thanks in advance.
Yes, a pipeline consists of a sequence of components which are executed one after another. So the order of the components matters.
The default pipeline indeed uses two instances of CountVectorsFeaturizer . The first one featurizes text based on words (as you can see here, "words" is the default value for analyzer). The second one featurizes text based on character n-grams, preserving word boundaries. We empirically found the second featurizer to be more powerful, but we decided to keep the first featurizer as well to make featurization more robust.
To learn more about pipelines in general, have a look at our docs on Choosing a Pipeline.