Prediction confusion

Hi all,

I am using French model and i have remarked that there is a kind of confusion between prediction in sentence with and without stop_words. I dont understand the cause of this confusion. Below more explication.

I have added stop_words for “CountVectorsFeaturizer” stop_words are: [‘est’,'ce,‘que’,‘tu’] I have then made two intents tries:

  • Est-ce que tu peux m’aider → gives intent_cant_help
  • Tu peux m’aider → gives intent_bot_notice
  • aider → gives intent_bot_notice

I cannot understand this difference as we normally remove all stop words in the pipeline before prediction.

For mor details. below is my configuration

language: fr

pipeline:
  - name: WhitespaceTokenizer
    token_pattern: (?u)\b\w+\b
  - name: CRFEntityExtractor
  - name: EntitySynonymMapper
  - name: CountVectorsFeaturizer
    analyzer: "word"
    stop_words: ['je', 'veux', 'souhaite', 'savoir', 'voudrais', 'il', 'elle', 'aimerai', 'aimerais', 'devrais', 'pourrais',  'vais', 'aime','alors','au','aucuns','aussi','autre','avant','avec','avoir','bon','car','ce','cela','ces','ceux','chaque','ci','comme','comment','dans','des','du','dedans','dehors','depuis','devrait','doit','donc','dos','début','elles','en','encore','essai','est','et','eu','fait','faites','fois','font','hors','ici','ils','juste','la','le','les','leurs','là','ma','maintenant','mais','mes','mien','moins','mon','même','ni','notre','nous','ou','où','par','parce','pas','peut','peu','plupart','pour','pourquoi','quand','que','quel','quelle','quels','quelles','qui','sa','sans','ses','seulement','si','sien','sont','son','sous','soyez','sur','ta','tandis','tellement','tels','tes','ton','tous','tout','trop','très','tu','voient','vont','votre','vous','vu','ça','étaient','été','être','a', 'à', 'pouvez', 'suis', '!', '?', '.', ':','au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux', 'il', 'ils', 'je', 'la', 'le', 'les', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aurez', 'auront', 'aurais', 'aurait', 'aurions', 'auriez', 'auraient', 'avais', 'avait', 'avions', 'aviez', 'avaient', 'eut', 'eûmes', 'eûtes', 'eurent', 'aie', 'aies', 'ait', 'ayons', 'ayez', 'aient', 'eusse', 'eusses', 'eût', 'eussions', 'eussiez', 'eussent', 'aimer', 'vouloir', 'quoi', 'pouvoir', 'devoir', 'chez', 'svp', 'stp', 'pense','parmi', 'dans', 'ceci', 'etant', 'parceque', 'tiens', 'celui', 'là', 'sait', 'via', 'voilà', 'sinon', 'suivant', 'pu', 'auprès', 'soi', 'même', 'etais', 'celle', 'ci', 'donc', 'alors', 'depuis', 'soit', 'soient', 'près', ]
  - name: DIETClassifier
    epochs: 200
    entity_recognition: False  
    RANDOM_SEED: 7777777 
  - name: FallbackClassifier
    threshold: 0.8



policies:
 - name: RulePolicy
   core_fallback_threshold: 0.3
   core_fallback_action_name: 'action_default_fallback'
   enable_fallback_prediction: True

@Asmacats can I ask why you are required to mention stop_words: In the generic pipeline of NLP it will automatically remove the stop of words and create the vocabulary (Rasa Alogorithms)

Please ref the rasa doc for more information :

There is no information regarding stop words in the pipeline, or I’m missing something?

Hello @nik202 . First of all i would like to thank you for your reponse.

I have seen rasa code https://github.com/RasaHQ/rasa/blob/main/rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py#L68

stop_words is equal to None by default. We can add our custom stop words. For example, i have added french words expressing a desire (‘would’,‘aim’, wish, desire …).

For my problem, i have foud someone else that have the same issue

For information, when i have added stop words, my components\train_CountVectorsFeaturizer5/vocabularies.pkl file does not contain any stop words. This prove that stop_words have been been excluded. But when a user send a message, it does not work fine. i dont find how i can debug rasa in order to show more details about user message vectorization and classification.

Any more information please ?

@Asmacats alright, strange.

Can you also try to mention the ngram in the pipeline?


"min_ngram": 2
"max_ngram": 2

also, try to delete the older train model and re-train againn?

Also please see this second pipeline from start : Tuning Your NLU Model

@Asmacats can I asked one simple question why you want to remove the stop words?

@Asmacats please see this video: NLP 4 Developers: Stop Words | Rasa - YouTube

hello @nik202, that not change any thing! I begin to understand the problem.

let stop_words= [‘tu’] and let user_message= "tu ne peux pas aider " → Intent is cant_help with confidence 0.8058311343193054

and let user_message_2= “ne peux pas aider” → Intent is Brad_notice with confidence 0.5921804308891296 (wrong intent)

Let us now map “tu” by another word and let user_message_2= "aa ne peux pas aider → intent is cant_help with confidence 0.8058311343193054

So, my conclusion is that the number of words in the sentence impact the classification quality. I think that stop_words are ignored but rasa have in memory that there is a word that have been ignored. So mapping this word, does not impact prediction but this word should exist.

do you think that’s right?

@Asmacats right.

For me, If I’m using words embedding then I exclude the stop_words are it will refine my vocabulary as the corpus size is large. But, in your case I am bit confused, he is considering the ngram into consideration based on the training examples, as he have complete vocabulary before and after removal on stop_words. I never applied such use-case so me also bit confused. Even, you mentioned the min and max ngram right and after that any change you noticed? how much training examples did you mentioned in the nlu per intents?

@nik202, there is no change by adding min_ngram and max_ngram

I have arround 20-30 examples per/intent.

i think that Rasa just keep word number of sentence with a consideration of stopwords.

I dont very well understand the mecanism and i found no explication. it is reasonble to open an issue ?

I have also remarked that case_sensitive does not work fine (i have added these in my pipeline). The prediction differs when we user uppercase and lowercase.