Can we use both word and character in word count featurizer in rasa

pranav.pawar · October 5, 2021, 4:35pm

can we use both word and character analysers in word count featurizer in rasa in the config file for pipeline does that work or not without producing errors and what are its effects exactly

ChrisRahme · October 5, 2021, 4:42pm

Do you mean the CountVectorsFeaturizer? If that’s the case, then yes. In fact, this is present in the default pipeline:

pipeline:
   - name: WhitespaceTokenizer
   - name: RegexFeaturizer
   - name: LexicalSyntacticFeaturizer
   - name: CountVectorsFeaturizer
   - name: CountVectorsFeaturizer
     analyzer: char_wb
     min_ngram: 1
     max_ngram: 4
   - name: DIETClassifier
     epochs: 100
     constrain_similarities: true
   - name: EntitySynonymMapper
   - name: ResponseSelector
     epochs: 100
     constrain_similarities: true
   - name: FallbackClassifier
     threshold: 0.3
     ambiguity_threshold: 0.1

The first CountVectorsFeaturizer is using default settings, which makes it a word analyzer. In the second one, it was changed to be a character anayzer.

From the docs:

This featurizer can be configured to use word or character n-grams, using the analyzer configuration parameter. By default analyzer is set to word so word token counts are used as features. If you want to use character n-grams, set analyzer to char or char_wb.

pranav.pawar · October 6, 2021, 4:34am

but can we assign ngram to default CountVectorsFeaturizer which is for word analyser along with char_wb analyser, and what are its actual impacts.can you please tell me.

ChrisRahme · October 6, 2021, 1:36pm

Yes, you can use n-grams for both word and char.

+---------------------------+-------------------------+--------------------------------------------------------------+
| Parameter                 | Default Value           | Description                                                  |
+===========================+=========================+==============================================================+
| use_shared_vocab          | False                   | If set to 'True' a common vocabulary is used for labels      |
|                           |                         | and user message.                                            |
+---------------------------+-------------------------+--------------------------------------------------------------+
| analyzer                  | word                    | Whether the features should be made of word n-gram or        |
|                           |                         | character n-grams. Option 'char_wb' creates character        |
|                           |                         | n-grams only from text inside word boundaries;               |
|                           |                         | n-grams at the edges of words are padded with space.         |
|                           |                         | Valid values: 'word', 'char', 'char_wb'.                     |
+---------------------------+-------------------------+--------------------------------------------------------------+
| strip_accents             | None                    | Remove accents during the pre-processing step.               |
|                           |                         | Valid values: 'ascii', 'unicode', 'None'.                    |
+---------------------------+-------------------------+--------------------------------------------------------------+
| stop_words                | None                    | A list of stop words to use.                                 |
|                           |                         | Valid values: 'english' (uses an internal list of            |
|                           |                         | English stop words), a list of custom stop words, or         |
|                           |                         | 'None'.                                                      |
+---------------------------+-------------------------+--------------------------------------------------------------+
| min_df                    | 1                       | When building the vocabulary ignore terms that have a        |
|                           |                         | document frequency strictly lower than the given threshold.  |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_df                    | 1                       | When building the vocabulary ignore terms that have a        |
|                           |                         | document frequency strictly higher than the given threshold  |
|                           |                         | (corpus-specific stop words).                                |
+---------------------------+-------------------------+--------------------------------------------------------------+
| min_ngram                 | 1                       | The lower boundary of the range of n-values for different    |
|                           |                         | word n-grams or char n-grams to be extracted.                |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_ngram                 | 1                       | The upper boundary of the range of n-values for different    |
|                           |                         | word n-grams or char n-grams to be extracted.                |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_features              | None                    | If not 'None', build a vocabulary that only consider the top |
|                           |                         | max_features ordered by term frequency across the corpus.    |
+---------------------------+-------------------------+--------------------------------------------------------------+
| lowercase                 | True                    | Convert all characters to lowercase before tokenizing.       |
+---------------------------+-------------------------+--------------------------------------------------------------+
| OOV_token                 | None                    | Keyword for unseen words.                                    |
+---------------------------+-------------------------+--------------------------------------------------------------+
| OOV_words                 | []                      | List of words to be treated as 'OOV_token' during training.  |
+---------------------------+-------------------------+--------------------------------------------------------------+
| alias                     | CountVectorFeaturizer   | Alias name of featurizer.                                    |
+---------------------------+-------------------------+--------------------------------------------------------------+
| use_lemma                 | True                    | Use the lemma of words for featurization.                    |
+---------------------------+-------------------------+--------------------------------------------------------------+

Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. This option can be used to create Subword Semantic Hashing.

For more technical explanations, I suggest you read online articles or watch the Rasa Algorithm Whiteboard.

Topic		Replies	Views
Rasa Pipeline Doubt Rasa Open Source	2	472	June 24, 2020
When doing "rasa init", why does the config.yml file have two "CountVectorFeaturizer"? Rasa Open Source	2	566	September 8, 2021
Add Ngram for Word level instead char level Rasa Open Source	3	1518	September 19, 2019
Two featurizer in rasa nlu config file [Deprecated] Rasa X Community Edition	8	1425	October 12, 2020
WhitespaceTokenizer ignored from pipeline Rasa Open Source	0	357	April 17, 2022

Can we use both word and character in word count featurizer in rasa

Related topics