Add Ngram for Word level instead char level

Rasa provides Ngram features component by default which able to do character level.

But I am looking for word level N-gram feature extraction.

How can I achieve that? Or Will I need to implement my own component for word level N-gram feature extraction. If yes, The below pipline looks correct or not?

    - name: "WhitespaceTokenizer"
    - name: "Tri-GramFeature"  ## own component #rasa
    - name: "CRFEntityExtractor"  
    - name: "EntitySynonymMapper"
    - name: "CountVectorsFeaturizer"
    - name: "EmbeddingIntentClassifier"

Hey @narendraprasath, and welcome to the forum!

The CountVectorsFeaturizer supports word n-grams as well. Take a look at all the available options here. In particular, you can use the (default) word analyzer and set your desired n-gram minimum and maximum lengths like this:

- name: "CountVectorsFeaturizer"
  analyzer: "word"
  min_ngram: 1
  max_ngram: 3

Does this answer your question? Also, the upcoming Rasa summit is a cool opportunity to meet Rasa contributors, creators and users, and discuss anything Rasa-related :slight_smile:

1 Like

Thanks @SamS

The answer sounds really good. it solves my problem.

Would this work for pre-trained embeddings or just for supervised embeddings.