Response_selector's accuracy very low

I switched my RASA from version 1.5.1 to 1.9.6.

I have custom featurizer based on fasttext in my pipeline. I modified it so DIET can be trained using custom features. It shows high scores (i_acc, e_f1) while training.

But Response selector’s r_acc is random and very low (randomly varies from approximately 0.001 to 0.1) and it doesn’t grow while training. What is the reason?

Hi @vitalyuf Can you please share the pipeline configuration that you use with us?

Thanks

Hi, @dakshvar22!

Yes, the pipline is:

pipeline: 
- name: "WhitespaceTokenizer"
- name: "yvi_imports.features.FTDenseFeaturizer"
- name: DIETClassifier 
   epochs: 100
- name: "ResponseSelector" 
   epochs: 1000

Also the code of FTDenseFeaturizer.train is:

...
from numpy import array
...
    def train(self, training_data, cfg, **kwargs):

        for example in training_data.training_examples:
            if 'response_tokens' in example.as_dict_nlu().keys():
                ft_vector = array(self._ft_embedder([tok.text for tok in example.as_dict_nlu()['response_tokens'][:-1]]+[' '.join([tok.text for tok in example.as_dict_nlu()['response_tokens']])], mean=True))
            else:
                ft_vector = [None]
            feats = self._combine_with_existing_dense_features(example, ft_vector)
            example.set("response_dense_features", feats)

        ft_vectors = [array(self._ft_embedder([tok.text for tok in ex.as_dict_nlu()['tokens'][:-1]]+[ex.as_dict_nlu()['text']], mean=True)) for ex in training_data.training_examples]
        for example, vector in zip(training_data.training_examples, ft_vectors):
            feats = self._combine_with_existing_dense_features(example, vector)
            example.set("text_dense_features", feats)

This line [tok.text for tok in example.as_dict_nlu()['response_tokens'][:-1]]+[' '.join([tok.text for tok in example.as_dict_nlu()['response_tokens']])] would transform a response like I am feeling better into ['I', 'am', 'feeling', 'better', 'I am feeling better']. I assume mean=True would include the embedding of the whole sentence as well(the last element of the list) in taking the mean. That can be problematic.

If you share the code for featurizer that you used in version 1.5.1 I can help you transform it for 1.9.6

Also, try using scale_loss=False in the configuration of response selector and see if that helps.

@dakshvar22, thank you very much!

The error was in my code.

My embedder _ft_embedder accepts list of tokens lists and returns an embedding vector for each list. And for ['I', 'am', 'feeling', 'better', 'I am feeling better'] it should return a separate embedding vector for every element of this list, treating each element as list of tokens. But all the elements are strings. I missed converting them to lists of tokens and they were treated as lists of chars. Stupid mistake.

Now responseselector train shows r_acc=0.845 and it steady grows according to current epoch number.

Out of curiosity, are you using a sentence embedding model or a word embedding model? In case you are using a word embedding model, the list to pass to _ft_embedder should be [['I', 'am', 'feeling', 'better']] ? Taking a mean over them and setting that as the feature vector seems appropriate then.

Yes, _ft_embedder is implementation of a word embedding model.

I used it to speed up debugging because it is faster then my initially used embedders Elmo and BERT (on RASA 1.5.1). After I fugure out how to prepare features for DIET and for ResponseSelector I will embed Elmo or BERT into my featurizer.

@dakshvar22, am I right that featurizer should provide for DIET and ResponseSelector lists of vectors (text_features and response_features), and every vector in a list should match a token, provided by a tokenizer component, except last one - which should be a vector matching the whole text?

Namely if I need to prepare features for 'I am feeling better' utterance should I set text features to [vector_for(['I']), vector_for(['am']), vector_for(['feeling']), vector_for(['better']), vector_for(['I', 'am', 'feeling', 'better'])]?

@vitalyuf That is very accurate :slight_smile:

Do try scale_loss=False for response selector and see if it helps.

ok, the results are the following

config:

- name: "WhitespaceTokenizer"
- name: "yvi_imports.features.FTDenseFeaturizer"
- name: DIETClassifier 
    epochs: 20
- name: "ResponseSelector" 
    epochs: 2000

train output:

2020-04-30 14:55:06 INFO rasa.nlu.selectors.response_selector - Retrieval intent parameter was left to its default value. This response selector will be trained on training examples combining all retrieval intents. Epochs: 100%|█████| 2000/2000 [02:09<00:00, 16.15it/s, t_loss=2.720, r_loss=1.265, r_acc=0.988]

config:

- name: "WhitespaceTokenizer"
- name: "yvi_imports.features.FTDenseFeaturizer"
- name: DIETClassifier 
    epochs: 20
- name: "ResponseSelector" 
    scale_loss: False
    epochs: 2000

train output:

2020-04-30 15:01:01 INFO rasa.nlu.selectors.response_selector - Retrieval intent parameter was left to its default value. This response selector will be trained on training examples combining all retrieval intents. Epochs: 100%|█████| 2000/2000 [02:09<00:00, 15.92it/s, t_loss=2.429, r_loss=1.730, r_acc=0.575]

But if I combine in one pipline several featurizers (RegexFeaturizer,LexicalSyntacticFeaturizer,CountVectorsFeaturizer, "yvi_imports.features.FTDenseFeaturizer") it looks like adding scale_loss: False gives higher accuracy and lower loss (comparing to the same pipeline with several featurizers and without scale_loss: False).

i am also facing same. Diet classifier training accuracy is 0.98 but response selector accuracy is 0.048 for 100 epoch. Dataset size around 20k FAQs. Please guide me also. How can i increase r_accuracy?

Me also…

@dakshvar22 I cant find the scale_loss argument in the docs using version 2.6. My problem is the opposite for me it shows accuracy as 1 but the loss is 6.5 or 5.9 in this range can you tell me why this is happening ?

@evilc3 Do you have a lot of examples which are very similar across different sub-intents of a retrieval intent? Training loss being high is an indication of that. Also, what’s the configuration pipeline that you are using?