Custom setting in CountVectorsFeaturizer pipeline?

Hi,

I was going through the article of CountVectorsFeaturizer. Components.

Could you please explain what is the use of setting

  1. use_shared_vocab

  2. lowercase: true

  3. token_pattern: r’(?u)\b\w\w+\b’

  4. max_features: None # int or None

Please provide an example if possible.

pipeline:
- name: "CountVectorsFeaturizer"
  # whether to use a shared vocab
  "use_shared_vocab": False,
  # whether to use word or character n-grams
  # 'char_wb' creates character n-grams only inside word boundaries
  # n-grams at the edges of words are padded with space.
  analyzer: 'word'  # use 'char' or 'char_wb' for character
  # the parameters are taken from
  # sklearn's CountVectorizer
  # regular expression for tokens
  token_pattern: r'(?u)\b\w\w+\b'
  # remove accents during the preprocessing step
  strip_accents: None  # {'ascii', 'unicode', None}
  # list of stop words
  stop_words: None  # string {'english'}, list, or None (default)
  # min document frequency of a word to add to vocabulary
  # float - the parameter represents a proportion of documents
  # integer - absolute counts
  min_df: 1  # float in range [0.0, 1.0] or int
  # max document frequency of a word to add to vocabulary
  # float - the parameter represents a proportion of documents
  # integer - absolute counts
  max_df: 1.0  # float in range [0.0, 1.0] or int
  # set ngram range
  min_ngram: 1  # int
  max_ngram: 1  # int
  # limit vocabulary size
  max_features: None  # int or None
  # if convert all characters to lowercase
  lowercase: true  # bool
  # handling Out-Of-Vacabulary (OOV) words
  # will be converted to lowercase if lowercase is true
  OOV_token: None  # string or None
  OOV_words: []  # list of strings

Hey, I can help you with the first two.

The CountVectorsFeaturizer creates a bag-of-words vector of your training-data sentences and intent-labels

use_shared_vocab determines whether those two use the same representation, as in are mapped to the same dimensions.

Here’s an example:

  • Training data: “Hello I am feeling well”
  • Intents: “greet”, “sentiment”

Now, your vocabulary is the entirity of tokens present in your training data. For simplicity’s sake let’s use words as our tokens.

  1. A shared vocab would look like this: (hello, I, am, feeling, well, greet, sentiment) and therefore:
  • “Hello I am feeling well” ==> (1,1,1,1,1,0,0)
  • “greet” ==> (0,0,0,0,0,1,0)
  • “sentiment” ==> (0,0,0,0,0,0,1)
  1. Different vocabs would mean you have a (hello, I, am, feeling, well) vector for sentences and a (greet, sentiment) vector for intents, meaning
  • “Hello I am feeling well” ==> (1,1,1,1,1)
  • “greet” ==> (1,0)
  • “sentiment” ==> (0,1)

Lowercase means that everything will be converted to lowercase, therefore in your vocabulary, “Hello” and “hello” are treated as the same token, so

(Unrelated to the first example)

  • lowercase: false “Hello hello there” ==> (1,1,1)
  • lowercase: true “Hello hello there” ==> “hello hello there” ==> (2,1)
1 Like

@EliDll Thank yor explaining in details with examples . I am beginner in ML/AI can you tell me what are pros and cons of using shared vocab and different vocab in rasa ?