Custom setting in CountVectorsFeaturizer pipeline?

piyush29programmer · October 11, 2019, 7:57am

Hi,

I was going through the article of CountVectorsFeaturizer. Components.

Could you please explain what is the use of setting

use_shared_vocab
lowercase: true
token_pattern: r’(?u)\b\w\w+\b’
max_features: None # int or None

Please provide an example if possible.

pipeline:
- name: "CountVectorsFeaturizer"
  # whether to use a shared vocab
  "use_shared_vocab": False,
  # whether to use word or character n-grams
  # 'char_wb' creates character n-grams only inside word boundaries
  # n-grams at the edges of words are padded with space.
  analyzer: 'word'  # use 'char' or 'char_wb' for character
  # the parameters are taken from
  # sklearn's CountVectorizer
  # regular expression for tokens
  token_pattern: r'(?u)\b\w\w+\b'
  # remove accents during the preprocessing step
  strip_accents: None  # {'ascii', 'unicode', None}
  # list of stop words
  stop_words: None  # string {'english'}, list, or None (default)
  # min document frequency of a word to add to vocabulary
  # float - the parameter represents a proportion of documents
  # integer - absolute counts
  min_df: 1  # float in range [0.0, 1.0] or int
  # max document frequency of a word to add to vocabulary
  # float - the parameter represents a proportion of documents
  # integer - absolute counts
  max_df: 1.0  # float in range [0.0, 1.0] or int
  # set ngram range
  min_ngram: 1  # int
  max_ngram: 1  # int
  # limit vocabulary size
  max_features: None  # int or None
  # if convert all characters to lowercase
  lowercase: true  # bool
  # handling Out-Of-Vacabulary (OOV) words
  # will be converted to lowercase if lowercase is true
  OOV_token: None  # string or None
  OOV_words: []  # list of strings

EliDll · October 11, 2019, 9:37am

Hey, I can help you with the first two.

The CountVectorsFeaturizer creates a bag-of-words vector of your training-data sentences and intent-labels

use_shared_vocab determines whether those two use the same representation, as in are mapped to the same dimensions.

Here’s an example:

Training data: “Hello I am feeling well”
Intents: “greet”, “sentiment”

Now, your vocabulary is the entirity of tokens present in your training data. For simplicity’s sake let’s use words as our tokens.

A shared vocab would look like this: (hello, I, am, feeling, well, greet, sentiment) and therefore:

“Hello I am feeling well” ==> (1,1,1,1,1,0,0)
“greet” ==> (0,0,0,0,0,1,0)
“sentiment” ==> (0,0,0,0,0,0,1)

Different vocabs would mean you have a (hello, I, am, feeling, well) vector for sentences and a (greet, sentiment) vector for intents, meaning

“Hello I am feeling well” ==> (1,1,1,1,1)
“greet” ==> (1,0)
“sentiment” ==> (0,1)

Lowercase means that everything will be converted to lowercase, therefore in your vocabulary, “Hello” and “hello” are treated as the same token, so

(Unrelated to the first example)

lowercase: false “Hello hello there” ==> (1,1,1)
lowercase: true “Hello hello there” ==> “hello hello there” ==> (2,1)

piyush29programmer · October 11, 2019, 2:22pm

@EliDll Thank yor explaining in details with examples . I am beginner in ML/AI can you tell me what are pros and cons of using shared vocab and different vocab in rasa ?

Topic		Replies	Views
When doing "rasa init", why does the config.yml file have two "CountVectorFeaturizer"? Rasa Open Source	2	565	September 8, 2021
Can we use both word and character in word count featurizer in rasa Rasa Open Source	3	552	October 6, 2021
sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided Getting Started with Rasa	2	288	October 9, 2020
Rasa Pipeline Doubt Rasa Open Source	2	470	June 24, 2020
A question about twice CountVectorsFeaturizer entry in supervised_embedding pipeline recipe Rasa Open Source	1	1218	October 15, 2019

Custom setting in CountVectorsFeaturizer pipeline?

Related topics