NLU detects random input with wrong intent and high confidence

with the server, I get:

curl 'localhost:5000/parse?q=qwerty&project=current&model=nlu' | python -m json.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   165    0   165    0     0    138      0 --:--:--  0:00:01 --:--:--   138
{
    "entities": [],
    "intent": {
        "confidence": 0.0,
        "name": null
    },
    "intent_ranking": [],
    "model": "nlu",
    "project": "current",
    "text": "qwerty"
}

it worked in virtual environment

Then, you do not use 0.13.1 in your main environment

The main environment also shows the same version but not working as above mentioned. When I created a virtual environment it works as above mentioned.

Did you try to reinstall rasa_nlu?

I just upgraded it. Seems now I will need to reinstall it.

I am facing a slightly different issue.When i input single words that has never been used in the data i get zero confidence.When i ask for some random genuine question out of my bot’s context it classifies it under the intent “inform”(inform here is my intent for question asked to my bot’s domain) with high confidence (70-80%). I am using tensor flow embedding pipeline.(tensorflow 0.10.0)

I face a smimiliar issue. If I just use single words like at which are conatined more often conatined in one intent, then it is classified as it. That is of course fine! But it has very high confidence like 0.97. I rather like that it is low for an fallback treshold. Why is there such an high confidence when you just train senetcnes with many words and single words like at.

So, it seems tensorflow embedding misses here some relative importance aspects (relative importance agaibst count of trained words) ?

@Abir 70-80 is actually relatively low, you should be able to handle that with the fallback policy. @datistiquo hm, if the word at is in your training data a lot, then i’m not surprised that the confidence is quite high. it doesn’t explicitly take into account the relative importance agaibst count of trained words

yes … but then what i was facing was , there were many contextual questions that were predicted with a confidence with (70-80%) . So you understand if i simply put a fallback with say 0.75 it would filter those contextual questions as well. But no worries i found a fix.

@akelad
I kinda solved it . What i did was , i added a json containing all the out of bound questions(assigned to some intent sayout_of_bound) and don’t use “spacy_sklearn” or “tesorflow_embedding” directly, rather selectively use underlying components and make your own pipeline . I used the following and it solved to some extent(not yet validated using cross-validation but it seems to be working fine).

language: “en”

pipeline:

  • name: “nlp_spacy”
  • name: “tokenizer_spacy”
  • name: “ner_crf”
  • name: “ner_synonyms” # you may remove it if you have not added any synonym data
  • name: “intent_featurizer_count_vectors”
  • name: “intent_classifier_tensorflow_embedding”

this nlu_config saved me for the time being . I would appreciate if you cross-validate it and ping the accuracy :slight_smile:

1 Like

that’s interesting that that works better :smiley: but cool, glad you found a solution

Could you explain this more as the below config seems like a normal one? How do you select a proper pipeline with a condition?

hey there , so basically i used spacy’s tokenizer and tensorflow’s classifier along with ner_crf , unlike the traditional predefined pipeline. you may have a look in the doc …they ain’t entirely same as the normal one.Reply to your second question is just permutation and combination , try out a few pipeline configuration and try to evaluate your model …at least i did that way.try out mine and see if it helps you in any way , feel free to play around with the nlu config

Facing same problem here. For garbage inputs like “asfdf” or “fdsgdf” it’s recognising intent with 0.95 accuracy.

2 Likes

We were having the same issue, nonsensical short input (most notably, a single digit) produced a high confidence (0.90+) intent matches. Since the main problem were digits, I decided to filter them out of the intent featurizer by editing the default token pattern regex (’(?u)\b\w\w+\b’) to exclude numbers. The idea was that we’d sacrifice tokens that have numbers (of which we have none as of now) in exchange for a better accuracy.

The results exceed our hopes. Somehow, not only single digit matches were gone, but also all non-alphanumeric input (like @harshitazilen described) started returning firmly a null intent. Most (but not all) short nonsensical alpha-only strings also started resulting in “intent”: { “name”: null, “confidence”: 0.0 }.

I realize that this is strange, given that the old expression was already filtering out non-alphanumeric tokens, and nothing has changed in regard to alpha-only strings, but the change helped us, so I thought I’d share it. Your mileage may vary.

Our NLUs run in EKS pods running Docker image rasa/rasa_nlu:0.14.4-full, but experiments with different versions (older, newer, and just Python modules on desktops) yielded similar results. The training base contains about 250 single and 250 dual intents with 10-100 sample utterances for each. This is our new pipeline with the edited token pattern:

pipeline:
 - name: "intent_featurizer_count_vectors"
   token_pattern: '(?u)\b[a-zA-Z][a-zA-Z]+\b'
 - name: "intent_classifier_tensorflow_embedding"
   intent_tokenization_flag: true
   intent_split_symbol: "+"

could you please share an example

Before:

{"q":"riggdp", "project":"test", "model":"intents-main"}
{
  "intent": {
    "name": "RigDWelcome",
    "confidence": 0.796666145324707
  }
}
{"q":"umo", "project":"test", "model":"intents-main"}
{
  "intent": {
    "name": "RigDWelcome",
    "confidence": 0.7992273569107056
  }
}

After:

{"q":"riggdp", "project":"test", "model":"intents-main"}
{
  "intent": {
    "name": null,
    "confidence": 0.0
}
}  
{"q":"umo", "project":"test", "model":"intents-main"}
{
  "intent": {
    "name": null,
    "confidence": 0.0
}  
}

I took a closer look at the training data, and these strings were apparently generated by our data augmentation task as example entity values for a different intent. After I changed the featurizer regex to filter out digits, I ran a couple of training sessions and evaluated the models with the same test file that was giving me those alpha-only false positives (among many others). My main concern were the number (esp. single digit) false positives, but I noticed that the alpha false positives also disappeared. I now realize that the reason for that is my use of the same test file: the newly augmented training data did not contain those specific values, so they were never matched. So it only appeared that the alpha-only false positives had gone away. A seemingly random example entity value from the training data could still “hijack” a completely unrelated intent and the NLU would start returning 0.83 confidence match when a single word like “prod” is entered: “Which system do you want to restart?” “Prod” “Hello and welcome! How can I help you?” :grinning:

I have since removed all entities from the TF model training data (rendering my regex hack moot), for a minor hit in recall and a major improvement in precision. The entities are extracted in a secondary, intent-specific model.

## intent: RigDWelcome
- What's up
- ahoy
- good afternoon
- goodmorning
- good evening
- hi there
- hey rigd
- Hello
- Hi
- Hey
- Howdy
- good morning
- goodafternoon
- goodevening
- hello there
- morning
- good day

thank you for the details. As far as I understand, it works now?

I am having a similar problem. A test user purposely entered the nonsense phrase: “soap on a rope.” The classifier classified it as mood_unhappy with a 99% confidence level!

The mood_unhappy intent had the word “rope” in one example, and “rope” does not appear in any other intent. “soap” does not appear in any training examples, and “on” and “a” appear in many training examples.

So, presumably it was the “rope” token that caused the classifier to classify the intent as mood_unhappy. But why 99% confidence?

I have repeated these results with other nonsense phrases where a single word appears in only one intent, and that intent is predicted with 95% plus confidence.

Any suggestions on what I should do?

1 Like