Prediction confusion

Asmacats · February 24, 2022, 9:37am

Hi all,

I am using French model and i have remarked that there is a kind of confusion between prediction in sentence with and without stop_words. I dont understand the cause of this confusion. Below more explication.

I have added stop_words for “CountVectorsFeaturizer” stop_words are: [‘est’,'ce,‘que’,‘tu’] I have then made two intents tries:

Est-ce que tu peux m’aider → gives intent_cant_help
Tu peux m’aider → gives intent_bot_notice
aider → gives intent_bot_notice

I cannot understand this difference as we normally remove all stop words in the pipeline before prediction.

For mor details. below is my configuration

language: fr

pipeline:
  - name: WhitespaceTokenizer
    token_pattern: (?u)\b\w+\b
  - name: CRFEntityExtractor
  - name: EntitySynonymMapper
  - name: CountVectorsFeaturizer
    analyzer: "word"
    stop_words: ['je', 'veux', 'souhaite', 'savoir', 'voudrais', 'il', 'elle', 'aimerai', 'aimerais', 'devrais', 'pourrais',  'vais', 'aime','alors','au','aucuns','aussi','autre','avant','avec','avoir','bon','car','ce','cela','ces','ceux','chaque','ci','comme','comment','dans','des','du','dedans','dehors','depuis','devrait','doit','donc','dos','début','elles','en','encore','essai','est','et','eu','fait','faites','fois','font','hors','ici','ils','juste','la','le','les','leurs','là','ma','maintenant','mais','mes','mien','moins','mon','même','ni','notre','nous','ou','où','par','parce','pas','peut','peu','plupart','pour','pourquoi','quand','que','quel','quelle','quels','quelles','qui','sa','sans','ses','seulement','si','sien','sont','son','sous','soyez','sur','ta','tandis','tellement','tels','tes','ton','tous','tout','trop','très','tu','voient','vont','votre','vous','vu','ça','étaient','été','être','a', 'à', 'pouvez', 'suis', '!', '?', '.', ':','au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux', 'il', 'ils', 'je', 'la', 'le', 'les', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aurez', 'auront', 'aurais', 'aurait', 'aurions', 'auriez', 'auraient', 'avais', 'avait', 'avions', 'aviez', 'avaient', 'eut', 'eûmes', 'eûtes', 'eurent', 'aie', 'aies', 'ait', 'ayons', 'ayez', 'aient', 'eusse', 'eusses', 'eût', 'eussions', 'eussiez', 'eussent', 'aimer', 'vouloir', 'quoi', 'pouvoir', 'devoir', 'chez', 'svp', 'stp', 'pense','parmi', 'dans', 'ceci', 'etant', 'parceque', 'tiens', 'celui', 'là', 'sait', 'via', 'voilà', 'sinon', 'suivant', 'pu', 'auprès', 'soi', 'même', 'etais', 'celle', 'ci', 'donc', 'alors', 'depuis', 'soit', 'soient', 'près', ]
  - name: DIETClassifier
    epochs: 200
    entity_recognition: False  
    RANDOM_SEED: 7777777 
  - name: FallbackClassifier
    threshold: 0.8



policies:
 - name: RulePolicy
   core_fallback_threshold: 0.3
   core_fallback_action_name: 'action_default_fallback'
   enable_fallback_prediction: True

nik202 · February 24, 2022, 5:53pm

@Asmacats can I ask why you are required to mention stop_words: In the generic pipeline of NLP it will automatically remove the stop of words and create the vocabulary (Rasa Alogorithms)

Please ref the rasa doc for more information :

There is no information regarding stop words in the pipeline, or I’m missing something?

Asmacats · February 24, 2022, 8:03pm

Hello @nik202 . First of all i would like to thank you for your reponse.

I have seen rasa code https://github.com/RasaHQ/rasa/blob/main/rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py#L68

stop_words is equal to None by default. We can add our custom stop words. For example, i have added french words expressing a desire (‘would’,‘aim’, wish, desire …).

For my problem, i have foud someone else that have the same issue

For information, when i have added stop words, my components\train_CountVectorsFeaturizer5/vocabularies.pkl file does not contain any stop words. This prove that stop_words have been been excluded. But when a user send a message, it does not work fine. i dont find how i can debug rasa in order to show more details about user message vectorization and classification.

Any more information please ?

nik202 · February 25, 2022, 3:06pm

@Asmacats alright, strange.

Can you also try to mention the ngram in the pipeline?


"min_ngram": 2
"max_ngram": 2

also, try to delete the older train model and re-train againn?

Also please see this second pipeline from start : Tuning Your NLU Model

@Asmacats can I asked one simple question why you want to remove the stop words?

@Asmacats please see this video: NLP 4 Developers: Stop Words | Rasa - YouTube

Asmacats · February 25, 2022, 3:50pm

hello @nik202, that not change any thing! I begin to understand the problem.

let stop_words= [‘tu’] and let user_message= "tu ne peux pas aider " → Intent is cant_help with confidence 0.8058311343193054

and let user_message_2= “ne peux pas aider” → Intent is Brad_notice with confidence 0.5921804308891296 (wrong intent)

Let us now map “tu” by another word and let user_message_2= "aa ne peux pas aider → intent is cant_help with confidence 0.8058311343193054

So, my conclusion is that the number of words in the sentence impact the classification quality. I think that stop_words are ignored but rasa have in memory that there is a word that have been ignored. So mapping this word, does not impact prediction but this word should exist.

do you think that’s right?

nik202 · February 25, 2022, 4:15pm

@Asmacats right.

For me, If I’m using words embedding then I exclude the stop_words are it will refine my vocabulary as the corpus size is large. But, in your case I am bit confused, he is considering the ngram into consideration based on the training examples, as he have complete vocabulary before and after removal on stop_words. I never applied such use-case so me also bit confused. Even, you mentioned the min and max ngram right and after that any change you noticed? how much training examples did you mentioned in the nlu per intents?

Asmacats · February 25, 2022, 7:23pm

@nik202, there is no change by adding min_ngram and max_ngram

I have arround 20-30 examples per/intent.

i think that Rasa just keep word number of sentence with a consideration of stopwords.

I dont very well understand the mecanism and i found no explication. it is reasonble to open an issue ?

I have also remarked that case_sensitive does not work fine (i have added these in my pipeline). The prediction differs when we user uppercase and lowercase.

Topic		Replies	Views
Confusion between intent with and without stop_words Rasa Open Source	2	260	March 1, 2022
Stop words not working properly (?) Rasa Open Source	3	735	February 24, 2022
Rasa bot is not predicting the correct intents Rasa Open Source	11	745	September 23, 2020
French langauge trained Bot picking up random intents for some utterences Rasa Open Source	13	1643	September 30, 2020
Intentclassification very unreliable, what pipeline components should I use Rasa Open Source	1	290	May 3, 2022

Prediction confusion

Related Topics