Linear norm confidence score is unreliable

Hi, I am trying to use linear_norm setting for confidence score calculation but I found that it’s not reliable in terms of generating a confidence score even though it is the recommended setting. To highlight the issue, I trained an NLU model with a few generic intents (e.g., yes, no, greeting, etc) and evaluated the NLU model using the training paraphrases, and to my surprise the confidence scores I get vary drastically across intents. For example: the confidence score for paraphrasesin “yes” intent are around 0.55 while for “goodbye” intent it’s 1.0.

In such a scenario, what I am struggling with is how to do I set a confidence threshold. Has anyone else faced a similar situation? Am I doing something wrong or if there are any ways to fix this?

I believe I need some way to normalize the confidence scores across intents but I don’t know if there is a parameter in the pipeline that could help with that. Would really appreciate any help I can get.

Below are my nlu.yml and config.yml files for reference. Also, included an evaluation report file that includes the predictions and confidence scores I get from the NLU model.

config.yml (376 Bytes)

nlu.yml (18.1 KB)

evaluation_report.json (98.1 KB)

Hi. Although not strictly the same issue, I am also experiencing an unexpected behavior with linear_norm that is the confidence returned is not spanning the 0-1 range but only a subset, e.g. 0.07 to 0.13 for both the successes and errors. The exact same problem pipeline ran with the model_confidence: softmax provides confidence within the expected 0-1 range. I am wondering if these problem stem from the same origin?

Hi @utkmittal1 I trained a model locally with your dataset and config. What I noticed by looking at your dataset was that there is a high degree of overlap in your intent structure.

One subset of overlapping intent names (as an example) is - no , no thanks , thanks , greeting thanks Example training data points for each of these intents -

- intent: no thanks
  - text: no that is it thank you for your assistance
  - text: not this time thanks though
  - text: no thanks that will be all
  - text: no that is it thanks so much
- intent: thanks
  - text: that is helpful thank you for your help today
  - text: thank you so much for your assistance
  - text: thank you very much
  - text: thank you poonam
  - text: thank you for your service
- intent: no
  - text: i say no
  - text: it is not
  - text: no never
  - text: do not
  - text: no leave it
  - text: no that has fine
- intent: greeting thanks
  - text: hi pranav thank you
  - text: hi shaina thank you
  - text: hello jasmeet thank you
  - text: hello thanks

The model accuracy when trained on the dataset is 1, but when you run one of the training examples through rasa shell you’ll see the confidences distributed across all these overlapping intents:

Next message:
no that is it thank you for your assistance
  "text": "no that is it thank you for your assistance",
  "intent": {
    "id": 2939475550064802194,
    "name": "no thanks",
    "confidence": 0.3340393006801605
  "entities": [],
  "intent_ranking": [
      "id": 2939475550064802194,
      "name": "no thanks",
      "confidence": 0.3340393006801605
      "id": 220804877218803758,
      "name": "thanks",
      "confidence": 0.2225290834903717
      "id": -1234731161240089573,
      "name": "no",
      "confidence": 0.1069733276963234
      "id": 6169020314623551898,
      "name": "greeting thanks",
      "confidence": 0.09575698524713516
      "id": 8050528673545081153,
      "name": "yes thanks",
      "confidence": 0.0944322943687439

Using model_confidence: softmax would just hide this problem as the confidence for top intent will be boosted towards 1.

I’d suggest improving the intent structure because without that the model will eventually start getting confused with more data points even with model_confidence: softmax. For example, you could potentially have just individual intents as thank you, no, greeting. Any particular reason why you would like to create combinations of these?

@gdl1 Can you please share some a small reproducible example? I again suspect there are a lot of overlapping intents which might be causing this for you as well.

@dakshvar22: Thanks for looking into the issue. I agree that there are intents that similar and I could combine some of them together. However, I have several other real intents in the training dataset that are very similar to each other and that where I am having the trouble because I cannot combine them and cannot find a reliable way to have a single confidence score threshold that I can use for all the intents.

I tried using model_confidence: softmax but the issue with that is my model would frequently predict intents with high confidence even for paraphrases that are completely unrelated. For example: it predicted “yes” when for paraphrases like “why”, “what to do”, etc with a score of 0.99+

This was the reason I tried to explore model_confidence: linear_norm but the confidence scores with linear_norm are all over the place whenever I have intents that have somewhat similar paraphrase (and the reality is that in most of the training dataset there will be intents which have similar paraphrases and there is no humanly possible way to keep the intents paraphrases distant from each other)

I have several other real intents in the training dataset that are very similar to each other and that where I am having the trouble because I cannot combine them and cannot find a reliable way to have a single confidence score threshold that I can use for all the intents

Can you give an example of such intents?

@dakshvar22: Let me see if I can share a bigger list of intents and paraphrases here but in the meantime here are a few examples:

- intent: make payment
  - I want to make a payment today
  - Can you help me make a payment?

- intent: payment extension
  - I wanted to check if I can extent my payment 
  - can you check if I am eligible for a payment extension

- intent: payment arrangement
  - I want to make a payment arrangement
  - Can you help me schedule a payment

These are very similar intents based on the paraphrases but at the same time i cannot combine them because they actually represent very different use cases.

payment extension can stay as a separate intent but you can merge make payment and payment arrangement into one intent and design your conversation flow through a form or a story which asks the user whether the payment has to be done immediately or does it have to be scheduled.

Nevertheless what are the predicted confidences when you train on a dataset with these intents? I couldn’t find these intents in the file you shared earlier.

Also, I would advocate against using any kind of generated NLU data (for e.g. paraphrases). You can read more about this here

@dakshvar22 I am using real user paraphrases to train my NLU so shared only a subset of intents that were not sensitive. I did some cleaning based on your suggestion but still facing the same issue. Attaching my nlu.yml file and the evaluation report for you to take a look. The average confidence scores on training paraphrases are below.

  "yes": 0.55056674612893,
  "greeting": 0.7431065042813619,
  "no": 0.8356322960721122,
  "thanks": 0.7859040173617277,
  "goodbye": 0.6338631226902917,
  "hurry up": 0.4691324098543687,
  "approved": 0.8311935861905416,
  "payment arrangement": 0.9316249246950503,
  "take_your_time": 0.7207559019327163,
  "help_requested": 0.4890884790155623,
  "confirm_working": 0.5636549906598197,
  "confirm_not_working": 0.4842517673969269,
  "payment extension": 0.7166380317587602

nlu.yml (9.7 KB) evaluation_report.json (98.1 KB) !

@utkmittal1 could you please provide an nlu test set that is classified wrongly with high confidence?

@gdl1 Are you setting constrain_similarities to True as well in your configuration? I’d recommend doing so, training and testing again and checking the confidence plots again.

@dakshvar22 my current pipeline is as follows:

language: en

- name: SpacyNLP
  model: "en_core_web_lg"
  case_sensitive: false
- name: SpacyTokenizer
- name: SpacyFeaturizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
  analyzer: "word"
- name: CountVectorsFeaturizer
  analyzer: "char_wb"
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  loss_type: cross_entropy
  model_confidence: linear_norm
  constrain_similarities: true
  epochs: 200
  intent_classification: true
  entity_recognition: false
  batch_strategy: balanced
- name: EntitySynonymMapper
# - name: FallbackClassifier
  # threshold: 0.90
  # ambiguity_threshold: 0.1

- name: MemoizationPolicy
- name: TEDPolicy
  max_history: 5
  epochs: 200
- name: RulePolicy

the only thing I change to obtained the two differents histograms is model_confidence

my data is somehow sensitive so I am still working on producing a generic dataset reproducing the problem.

@dakshvar22 I am already using constrain_similarities: True. Here is my pipeline.

language: en

  - name: HFTransformersNLP
    model_name: distilbert
    model_weights: distilbert-base-uncased
    cache_dir: /tmp
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 50
    model_confidence: linear_norm
    constrain_similarities: True
    loss_type: cross_entropy
  - name: EntitySynonymMapper

  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: RulePolicy
    core_fallback_threshold: 0.1
    core_fallback_action_name: action_default_fallback
    enable_fallback_prediction: true

@Ghostvv Here are a few examples that it predicts with high confidence. My data is sensitive so can’t share it here but I created a few examples for you to take a look. The confidence scores for all of these paraphrases are very high but only when I use softmax.

  - text: why
  - text: what to do?
  - text: how come? 
  - text: what are you?
  - text: quiet
  - text: donald trump is not the president
  - text: who won?
  - text: don't talk anymore
  - text: why are you talking so much
  - text: I am scared
  - text: what a goal that was
  - text: this sucks
  - text: this was disappointing

Here is the evaluation report and histogram for these paraphrases. evaluation report.json (2.1 KB)

based on the training data you provided, I don’t see how the model could assign low confidences to these examples. you have why not as intent yes. Therefore it assigns why to yes as well with high confidence. Similar for other examples in a list

@dakshvar22 I (finally) created a synthetic NLU dataset that demonstrates the exact same problem I am experiencing with my actual dataset (which I could not share). As in my previous post, I observe that a model trained using softmax behave adequately while just changing the confidence to linear_norm makes it unusable as the it is not possible to set a fallback threshold anymore.

Problem is in the range of confidences which seems to progressively narrow as the number of intents considered is increasing.

In my synthetic problem I created 30 intents revolving around ‘car’ related questions, but my actual problem as 150 intents and the problem is all the more severe (confidence range is extremely narrow Linear norm confidence score is unreliable - #5 by gdl1). Please find attached the data to reproduce as well as the results.

domain.yml (1.0 KB) config.yml (823 Bytes) nlu.yml (11.1 KB) intent_report_softmax.json (5.9 KB) intent_report_linear_norm.json (6.0 KB)


@Ghostvv I’m having the same issue when using linear_norm. Any suggestions? Or should we just use softmax? Thanks!

1 Like