NLU confidence score actually probability distribution?

Hi everyone,

I want to use the NLU part of the rasa model. I’ve played around a little by modifying the mood-bot example. I noticed that, whatever my input, the ‘confidence scores’ always sum up to 1 or close to 1. Could it be that these confidences are actually a probability distribution over all intents? If yes, is it possible to get the underlying real confidences for this simple model? Should I use a different classifier - which one (mood-bot uses DIET classifier)?

Note: We want to apply a threshold to the intent ranking. However, this does not work if we get high confidence scores for nonsensical input, e.g. “whim” has a confidence of 0.8 for intent “greet” in the mood-bot.

Thanks for your help!

Hey @s8ttschm, welcome to the forum and thanks for an interesting question!

You are right, the scores should sum to 1 like a probability distribution, that is intended.

As for the moodbot example, it is a very simplistic one with very little data, hence I wouldn’t treat the scores (whether the scaled ones that sum to 1, or the raw ones) as reliable confidence measures. To get more reliable ones, I definitely recommend adding much more training data or training on a different dataset that has more data.

Raw scores: If you mean the similarity scores between message embeddings and intent embeddings, then I’ll say one thing: Thresholding based on those scores is even more tricky than using the scaled scores. However, we’ll be soon adding the possibility to get raw similarities during inference – you might be especially interested in cosine similarities which could theoretically be used for thresholding too.

Note on your example with 0.8: I’ve seen NLP models (BERT for example) becoming very confident and typically producing softmax scores around 0.99 for correct predictions and around 0.9-0.95 for incorrect predictions. Hence, I think that it’s important to adjust your threshold based on the observed scores. It may be that a high number like 0.8 actually points to incorrect predictions quite reliably…

Let me know if you’ve got any more questions :slight_smile:

Thanks for your explanation. For our project we need scores that are comparable over different runs. Normalized scores are absolutely meaningless in that context. Based on your assessment that similarity scores are difficult to deal with, we’ve resolved to applying sigmoid instead of softmax to the similarity scores.

I’ll share future insights with the community.

*Edit: the function where we exchanged softmax with sigmoid is confidence_from_sim in rasa-master > rasa > utils > tensorflow > layer.py

@s8ttschm that’s fair. @dakshvar22 is this similar to the change you’ve made? (Adding the sigmoid.)

@SamS It looks similar but slightly different.

@s8ttschm you can try out sigmoid_loss branch. Set constrain_similarities=True and model_confidence=cosine in your configuration for DIETClassifier. constrain_similarities=True applies a sigmoid term to the loss function during training. Based on experiments we believe this helps the model with better generalizable similarities. model_confidence=cosine computes cosine similarities from the inner product similarity during inference and directly returns those as model confidences. If you compare the intent histograms before and after this change, you’ll see that finding a fallback threshold will be easier than before and usually the model won’t be over-confident during wrong predictions. We plan to merge this change very soon and should be out with the next Rasa Open Source release this month.

Applying sigmoid during inference will have a similar behaviour as before because sigmoid function also saturates after a range. Also, if you are not using sigmoid during training, there will be a mismatch between activation function during training and inference because of which the model may not work very well in the wild.

1 Like

Your comment has been very helpful, indeed. Thank you so much!

Also, if you are not using sigmoid during training, there will be a mismatch between activation function during training and inference because of which the model may not work very well in the wild.

Thanks, I hadn’t considered that!