NLU classifies unseen utterances to a random intent with too high confidence

vizhong · July 2, 2020, 6:04pm

Hi community, I am confused with the NLU intent classification. In my case, I have three intents: small talk, task intent 1, task intent 2. Data for small talk is dominant in the training data. I use de_core_news_lg for SpaCy. My nlu config is like below.

language: "de"  # your two-letter language code

pipeline:
 - name: SpacyNLP
 - name: SpacyTokenizer
 - name: SpacyFeaturizer
 - name: RegexFeaturizer
 - name: LexicalSyntacticFeaturizer
 - name: CountVectorsFeaturizer
 - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
- name: DIETClassifier
    epochs: 100
    batch_strategy: sequence
- name: EntitySynonymMapper
- name: ResponseSelector
   epochs: 100

I tested the nlu with unseen data like “switch on the light”, i.e. utterances that totally new to the nlu and not related to the other task intents. The classifier outputs the intent small talk with a confidence of 99%.

My expectation is that the confidence for the intent would be low because the utterance is totally not related to training data. Why the confidence is such high? Thanks in advance!

koaning · July 3, 2020, 9:06am

This is a general downside to classification models; the output proba/confidence values always sum up to one.

There’s a long story of what I think is going wrong here (blogpost, pydata talk) but I’ll try to give the short story here.

Let’s say there’s three classes that I’d like to classify. Let’s take this artificial example;

I could train an algorithm on this dataset and it might produce a prediction like this;

The strong colors that the model has a high confidence (>0.8) and the weaker colors indicate less certain regions. It’s less certain in certain regions because the two colors might overlap.

Now here’s the issue, let’s zoom out a bit.

Notice how the algorithm still assigns a strong red color in regions where it has seen no data whatsoever.

This is a general phenomenon in classification algorithms. The algorithm will look for examples that are the most similar, even if the distance from of the training examples is huge. There’s no outlier detection happening when it is computing a confidence score. It’s a number that is a proxy for confidence but it is not 100% the same thing. My collegue Rachael also made a youtube video about how you might interpret this confidence value.

I can’t be 100% sure if there’s not something odd happening in your training data. It could be that there are different classes some of which cover a lot of ground linguistically while others are very narrow in terms of meaning. This could also cause overlap. But the main thing to take away is that this confidence score is more like “artificial confidence” than actual confidence much like how “artificial intelligence” is much more “artificial” than actual intelligence.

vizhong · July 8, 2020, 11:53am

Thank you for your explanation and the further readings! Very helpful

Topic		Replies	Views
NLU detects random input with wrong intent and high confidence Rasa Open Source	39	5259	July 27, 2022
Error in intent classification - Without slots or entities Rasa Open Source	3	467	May 12, 2020
Wrong Confidence Rasa Open Source	1	309	February 26, 2020
Improve NLU accuracy / Avoid intent confusion Rasa Open Source	5	1785	June 9, 2023
Rasa NLU predicts unseen data with confidence of 1 Rasa Open Source	1	397	January 11, 2022

NLU classifies unseen utterances to a random intent with too high confidence

Related topics