Fallback doesn't work with 2 retrieval intents FAQ and Chitchat

Hello,

I’m working with rasa 2.3.4.

I have two retrieval intents (faq and chitchat), when I provide a random input, Rasa NLU classifies the input using those retrieval intents when logically it’s a nlu_fallback intent.

This problem occurs even in the previous versions.

I found that with the new version 2.3.4 model_confidence " This should ease up tuning fallback thresholds as confidences for wrong predictions are better distributed across the range [0, 1]"

But in my case, that didn’t work.

config.yml (528 Bytes)

domain.yml (705 Bytes)

chitchat.yml (1.2 KB)

faq.yml (598 Bytes)

rules.yml (282 Bytes)

Also, I tried to vary the model_confidence parameter (softmax, linear_norm, even cosine in <=2.3.3).

When I execute the “rasa shell nlu” command, I found that the confidence of random inputs is too high, like the following example:

ab

The problem is that this ‘ab’ token doesn’t exist in the training data and on the other hand the min/max char ngram = 4.

I tried to test that on another projects with more training data but I get always the same results.

I think that this is a problem, especially when we create Q/A assistants.

I hope you could give me some insights on that. Thanks.

It seems weird to me that you have confidence 1.0. That shouldn’t be possible to happen with ML components. Did you use rasa shell or rasa interactive for this?

Hey @Tobias_Wochinger ,

Thanks for your reply.

It’s weird for me too. As I mentioned before, I used “rasa shell nlu” command.

Hi @Yasmine, I tried out your assistant locally and there are two factors contributing to this -

  1. Assistant has only 2 intents
  2. Confidence measure normalizing the confidences across intents.

We plan to ship a new option for model_confidence as a solution for (2) which would output absolute similarities as confidences.

However, do you plan to have only 2 intents as part of your assistant?

Hello @dakshvar22,

Thank you for your reply.

Yes, Some Q/A assistants need only 2 retrieval intents to work properly, like chitchat and faq in this case.

Otherwise, I tried some assistants with classical intents instead of two retrieval intents. But, I noticed that the linear_norm option gives a very low confidence (the opposite of softmax). I mean by low, a confidence = 0.016… for an input that exists already in training data.

Could you give me an example in which model_confidence=linear_norm is helpful ?

Thanks in advance!

That is actually better because with softmax the model is overly confident about almost everything(right or wrong). If the input that exists already in the training data is being classified with such low confidence it means that it is being highly confused with some other intent and the training data should be investigated for such clash. model_confidence=softmax masks off such problems in many cases.

Hey @dakshvar22 ,

Thank you for your reply.

In fact, I tried that with many examples from different intents and the confidence is still low. The problem here, is that I couldn’t fix a good fallback threshold to avoid wrong inputs. Also, I’ve noticed that this behavior happens when the project is large, because with small projects I’m getting the same problem as softmax (high confidence for random inputs)

I converted the previous files using normal intents :

chitchat.yml (1.1 KB)

faq.yml (582 Bytes)

rules.yml (889 Bytes)

config.yml (397 Bytes)

domain.yml (730 Bytes)

So here, I was expecting to get better confidence because I have more than 2 intents (9 intents). But with the same input I tested before, I got :

ab-test

It decreases but still high.

So, even with 9 intents the confidence is high ( even the input doesn’t exist in training data and min/max char ngrams = 4).

Thanks!

Do you mean that for large projects it’s not a problem? By large project I mean large number of intents and each intent having a good amount of data.

Hey @dakshvar22 ,

In fact, I found out 2 things when I was testing the new model_confidence value “linear_norm”.

With a large project, I got very low confidences even though the example exists already in training data ( like the 0.016… value I mentioned before)

With a small project, I got high confidences even though the example doesn’t exist in training data (like the previous example with “ab” input)

Thanks!

Hi Yasmine, thanks for clarifying that.

For a small project, this is slightly expected because with small amount of data, the model isn’t able to learn properly what’s legible and what’s gibberish. It needs more data to figure that out. I would park that problem for now because in production you wouldn’t have such small amount of data anyways.

For a large project, as I mentioned if an example is being classified with low confidence with linear_norm, it means that multiple intents are competing for the correct class and that could be happening very much because of wrong annotations / overlapping intent classes / similar examples across different intents.

I would like to go deeper into the latter problems with your assistant. As a first step, are you familiar with how to install Rasa from source and work with experimental branches of Rasa Open Source? This is the recommended way to install from source.

The objective is to try out small changes in the source code and see what works best in your case. I can’t guarantee that we’ll reach a solution but I’m sure we’ll learn something about your assistant and what’s really happening in the model. Let me know if you are up for some experimentation :slight_smile:

Hey @dakshvar22 ,

Thank you for your quick reply.

I have similar examples across different intents in my training data. Maybe the problem is due to that.

I will try to clean my dataset and install rasa from source as you suggested.

And, yes if you have another suggestions, I’m willing to try them.

Thank you so much.

Awesome!

I have similar examples across different intents in my training data. Maybe the problem is due to that.

That definitely sounds like a valid problem. I’d suggest checking how you can reduce that overlap. It may require some restructuring of intents.

if you have another suggestions, I’m willing to try them.

I have created a new branch named investigate_low_confidence. Once you have installed rasa from source, you can pull this branch and switch to it. It would be great if you can re-train your large assistant using this branch of rasa with constrain_similarities: True and model_confidence: linear_norm . Subsequently, compare the test F1 scores when trained on this branch v/s when trained using 2.4.0 version of Rasa Open Source. Also, then compare predicted confidences for examples that are present in the training data and also gibberish examples. Let me know your observations and then we can think of some next steps.

did you ever find a solution to this? I am facing the same problem with trying to trigger a fallback