Hi community, I am confused with the NLU intent classification. In my case, I have three intents: small talk, task intent 1, task intent 2.
Data for small talk is dominant in the training data. I use de_core_news_lg for SpaCy. My nlu config is like below.
I tested the nlu with unseen data like “switch on the light”, i.e. utterances that totally new to the nlu and not related to the other task intents. The classifier outputs the intent small talk with a confidence of 99%.
My expectation is that the confidence for the intent would be low because the utterance is totally not related to training data.
Why the confidence is such high?
Thanks in advance!
This is a general downside to classification models; the output proba/confidence values always sum up to one.
There’s a long story of what I think is going wrong here (blogpost, pydata talk) but I’ll try to give the short story here.
Let’s say there’s three classes that I’d like to classify. Let’s take this artificial example;
I could train an algorithm on this dataset and it might produce a prediction like this;
The strong colors that the model has a high confidence (>0.8) and the weaker colors indicate less certain regions. It’s less certain in certain regions because the two colors might overlap.
Now here’s the issue, let’s zoom out a bit.
Notice how the algorithm still assigns a strong red color in regions where it has seen no data whatsoever.
This is a general phenomenon in classification algorithms. The algorithm will look for examples that are the most similar, even if the distance from of the training examples is huge. There’s no outlier detection happening when it is computing a confidence score. It’s a number that is a proxy for confidence but it is not 100% the same thing. My collegue Rachael also made a youtube video about how you might interpret this confidence value.
I can’t be 100% sure if there’s not something odd happening in your training data. It could be that there are different classes some of which cover a lot of ground linguistically while others are very narrow in terms of meaning. This could also cause overlap. But the main thing to take away is that this confidence score is more like “artificial confidence” than actual confidence much like how “artificial intelligence” is much more “artificial” than actual intelligence.