can we use both word and character analysers in word count featurizer in rasa in the config file for pipeline does that work or not without producing errors and what are its effects exactly
Do you mean the CountVectorsFeaturizer? If that’s the case, then yes. In fact, this is present in the default pipeline:
pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
constrain_similarities: true
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100
constrain_similarities: true
- name: FallbackClassifier
threshold: 0.3
ambiguity_threshold: 0.1
The first CountVectorsFeaturizer is using default settings, which makes it a word analyzer. In the second one, it was changed to be a character anayzer.
From the docs:
This featurizer can be configured to use word or character n-grams, using the
analyzer
configuration parameter. By defaultanalyzer
is set toword
so word token counts are used as features. If you want to use character n-grams, setanalyzer
tochar
orchar_wb
.
but can we assign ngram to default CountVectorsFeaturizer which is for word analyser along with char_wb analyser, and what are its actual impacts.can you please tell me.
Yes, you can use n-grams for both word and char.
+---------------------------+-------------------------+--------------------------------------------------------------+
| Parameter | Default Value | Description |
+===========================+=========================+==============================================================+
| use_shared_vocab | False | If set to 'True' a common vocabulary is used for labels |
| | | and user message. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| analyzer | word | Whether the features should be made of word n-gram or |
| | | character n-grams. Option 'char_wb' creates character |
| | | n-grams only from text inside word boundaries; |
| | | n-grams at the edges of words are padded with space. |
| | | Valid values: 'word', 'char', 'char_wb'. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| strip_accents | None | Remove accents during the pre-processing step. |
| | | Valid values: 'ascii', 'unicode', 'None'. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| stop_words | None | A list of stop words to use. |
| | | Valid values: 'english' (uses an internal list of |
| | | English stop words), a list of custom stop words, or |
| | | 'None'. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| min_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly lower than the given threshold. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly higher than the given threshold |
| | | (corpus-specific stop words). |
+---------------------------+-------------------------+--------------------------------------------------------------+
| min_ngram | 1 | The lower boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_ngram | 1 | The upper boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_features | None | If not 'None', build a vocabulary that only consider the top |
| | | max_features ordered by term frequency across the corpus. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| lowercase | True | Convert all characters to lowercase before tokenizing. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| OOV_token | None | Keyword for unseen words. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| OOV_words | [] | List of words to be treated as 'OOV_token' during training. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| alias | CountVectorFeaturizer | Alias name of featurizer. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| use_lemma | True | Use the lemma of words for featurization. |
+---------------------------+-------------------------+--------------------------------------------------------------+
Option char_wb
creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. This option can be used to create Subword Semantic Hashing.
For more technical explanations, I suggest you read online articles or watch the Rasa Algorithm Whiteboard.