ResponseSelector low accuracy

Hi all,

I am facing problems with my response selector. Its accuracy is very low (0.02), and of course it gives a lot of false answers. My pipeline is:

language: el pipeline:

  • name: HFTransformersNLP model_name: “bert” model_weights: “bert-base-multilingual-cased” cache_dir: packages/langdata
  • name: LanguageModelTokenizer
  • name: LanguageModelFeaturizer
  • name: RegexFeaturizer
  • name: LexicalSyntacticFeaturizer
  • name: CountVectorsFeaturizer
  • name: CountVectorsFeaturizer analyzer: “char_wb” min_ngram: 1 max_ngram: 4
  • name: DIETClassifier epochs: 100
  • name: EntitySynonymMapper
  • name: ResponseSelector epochs: 100

I am using it only for a single intent (chitchat) but I have many sentence pairs (> 4000). I was looking in the documentation that it is similar to DIETClassifier, but I could’t figure what it does, and whether the default values of the network ( hidden_layers_sizes, embedding_dimension, etc.) are enough for my case…

@petasis What version of Rasa are you using?

I am using 1.10.1

How is the performance of the DIETClassifier? It might be that the language model you are using is not the best. Did you tried any other language model or even training without any pre-trained language model? Also it is quite hard to say what is going wrong without looking at the data. Could you maybe share an excerpt of your data?

Hi again. An example:

My nlu.md file has:

## intent: chitchat/greetings0en02
- Good Morning
- Good Morning.

My responses.md has:

## greetings0en02
* chitchat/greetings0en02
    - Good day.
    - Good Morning.

I am using bert multilingual embeddings (“bert-base-multilingual-cased”). In rasa shell, I type “good morning”.

2020-09-08 12:26:57 DEBUG    rasa.core.policies.mapping_policy  - The predicted intent 'chitchat' is mapped to  action 'respond_chitchat' in the domain.
2020-09-08 12:26:57 DEBUG    rasa.core.policies.form_policy  - There is no active form
2020-09-08 12:26:57 DEBUG    rasa.core.policies.fallback  - NLU confidence threshold met, confidence of fallback action set to core threshold (0.3).
2020-09-08 12:26:57 DEBUG    rasa.core.policies.ensemble  - Predicted next action using policy_2_MappingPolicy
2020-09-08 12:26:57 DEBUG    rasa.core.processor  - Predicted next action 'respond_chitchat' with confidence 1.00.
2020-09-08 12:26:57 DEBUG    rasa.core.actions.action  - Picking response from selector of type default
2020-09-08 12:26:57 DEBUG    rasa.core.processor  - Action 'respond_chitchat' ended with events '[BotUttered('Thank you!', {"elements": null, "quick_replies": null, "buttons": null, "attachment": null, "image": null, "custom": null}, {}, 1599557217.3225248)]'

The intent is correct, “chitchat”. I am not sure if the “sub-intent” is correct. But the response selector returns “thank you”.

Why isn’t selected one of “Good day.” or “Good Morning.”? What follows ‘/’ in intent name is ignored?

@petasis Thanks for providing some examples. Can you try the following config -

language: el
pipeline:
  - name: HFTransformersNLP
    model_name: "bert"
    model_weights: "bert-base-multilingual-cased"
    cache_dir: packages/langdata
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    alias: "lmf"
  - name: RegexFeaturizer
    alias: "rf"
  - name: LexicalSyntacticFeaturizer
    alias: "lsf"
  - name: CountVectorsFeaturizer
    alias: "cvf_w"
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
    alias: "cvf_c"
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100
    featurizers: ["cvf_w", "lmf"]

This basically excludes features from RegexFeaturizer, LexicalSyntacticFeaturizer and char level CountVectorsFeaturizer from being used for the response selector.

@dakshvar22 Thank you for answering this. To say the truth, in the meantime I wrote my own response selector, which is much much simpler. I also started a new post: I am trying to create a new response selector: How to prepare features for the class?

What I did is simple: I map input into an intent (such as chitchat/action1), and then I select a response from all available responses in the responses.md file.

This works better than the rasa’s response selector (much fewer errors). although it still has some mis-classifications. And it solves the problem of not responding all responses.

I think in my case, where chitchat is huge (~23.000 questions) and almost as many answers, embedding both answers & questions in the same space and selecting through similarities, is not a good strategy.

@petasis Thanks for that feedback. We have actually made that possible now with the latest upcoming release of Rasa Open source 2.0. Expect it to be out sometime this week or early next week.

Yes, I saw it. I will try it, probably tomorrow.

@dakshvar22 I have tested the new response selector, but unfortunately it fails:

 2020-10-05 16:12:42 DEBUG    rasa.nlu.selectors.response_selector  - Following metrics will be logged during training: 
2020-10-05 16:12:42 DEBUG    rasa.nlu.selectors.response_selector  -   t_loss (total loss)
2020-10-05 16:12:42 DEBUG    rasa.nlu.selectors.response_selector  -   r_acc (response acc)
2020-10-05 16:12:42 DEBUG    rasa.nlu.selectors.response_selector  -   r_loss (response loss)
2020-10-05 16:12:42 DEBUG    rasa.utils.tensorflow.models  - Building tensorflow train graph...
2020-10-05 16:13:09 DEBUG    rasa.utils.tensorflow.models  - Finished building tensorflow train graph.
Epochs: 100%|=========| 50/50 [23:17<00:00, 27.96s/it, t_loss=4.390, r_loss=4.390, r_acc=0.048]

Accuracy is extremely low. My pipeline is:

pipeline:
  - name: packages.LanguageDetection.LanguageDetection
  - name: HFTransformersNLP
    # Name of the language model to use
    model_name: "bert"
    # Pre-Trained weights to be loaded
    #model_weights: "nlpaueb/bert-base-greek-uncased-v1"
    model_weights: "bert-base-multilingual-uncased"
    cache_dir: packages/langdata
    alias: "embeddings"
  - name: LanguageModelTokenizer
    # Flag to check whether to split intents
    intent_tokenization_flag: False
    # Symbol on which intent should be split
    intent_split_symbol: "_"
  - name: LanguageModelFeaturizer
    alias: "lmf"
  - name: RegexFeaturizer
    # Text will be processed with case sensitive as default
    case_sensitive: True
    alias: "rf"
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
    use_lemma: False
    # Set the out-of-vocabulary token
    OOV_token: "_oov_"
    # Whether to use a shared vocab
    use_shared_vocab: False
  - name: RegexEntityExtractor
  - name: DIETClassifier
    epochs: 50
    random_seed: 20212020
  - name: EntitySynonymMapper
#  - name: packages.ResponseSelectorThroughIntent.ResponseSelectorThroughIntent
#    epochs: 50
#    random_seed: 20212020
  - name: ResponseSelector
    epochs: 50
    random_seed: 20212020
    featurizers: ["lmf"]
  - name: FallbackClassifier
    threshold: 0.4
    ambiguity_threshold: 0.1

What version of Rasa did you try? Also, how many examples for response selector do you have?

rasa --version
Rasa Version     : 2.0.0rc3
Rasa SDK Version : 2.0.0rc1
Rasa X Version   : None
Python Version   : 3.8.5 (default, Aug 12 2020, 00:00:00) 
Operating System : Linux-5.8.12-200.fc32.x86_64-x86_64-with-glibc2.2.5
Python Path      : /usr/bin/python3
intent: flight_departure_info, training examples: 5953   
intent: flight_arrival_info, training examples: 203   
intent: inform, training examples: 2429   
intent: affirm, training examples: 32   
intent: deny, training examples: 19   
intent: stop, training examples: 58   
intent: search_encyclopedia, training examples: 32   
intent: search_weather, training examples: 33   
intent: chitchat_el, training examples: 21488   
intent: insult, training examples: 236   
intent: thank_you, training examples: 102   
intent: chitchat_en, training examples: 1713

The response selector is concerned with chitchat* intents.

@dakshvar22 Is there a document describing how response selector works in rasa 2.0rc4? Because the results are very low, and at the same time a simple DIET classifier (that simply maps input to a class like chitchat/ask) is ~0.98. In reality it is much lower (i.e. when asking all questions, the final accuracy is 0.8), but still rasa’s response selector seems to not be able to handle the data. Perhaps a problem in the training format data?

Ok, for my case, my classifier seems to work better. So, I tried to simulate this in Rasa 2.0 Response Selector and results are promising:

  - name: ResponseSelector
    number_of_transformer_layers: 2
    transformer_size: 256
    scale_loss: False
    use_sparse_input_dropout: True
    use_dense_input_dropout: True
    hidden_layers_sizes:
      text:  []
      label: []
    featurizers: ["cvf_w"]
    use_text_as_label: False
    epochs: 100
    random_seed: 20212020

Switching to a transformer achieves accuracy=0.9742 when evaluated on the training data, which is ok…

Thanks for sharing your results. Good to know it works better with transformer layers. Do you have a test set as well on which you can evaluate?