TED classifier has lower accuracy after migration to Rasa 2.0

kosniaz · January 18, 2021, 5:28pm

Hello community!

We are migrating to rasa 2.0, and we have observed a rather strange issue. The training stories are the same, and the batch sizes were restored to [8,32], as they are in Rasa 1.10 by default. However training accuracy has fallen significantly (around 10%, or worse after the network starts overfitting). We are talking about a change from ~98% (Rasa 1.10) to ~89% (Rasa 2.0). What’s more, we’ve seen some failures of the bot when testing our bot, which didn’t happen before.

Has anyone else noticed this? Could it be that the accuracy displayed when training Rasa 1.x is not the same value as the one displayed in Rasa 2.0 ? Have there been other changes to the training of the TED model?

Any comment could be helpful -

Thanks for reading!

Ghostvv · January 19, 2021, 3:30pm

how many epochs do you use?

kosniaz · January 20, 2021, 11:18am

70 epochs in both cases. What really bugs me though is that, with Rasa 1.10 we got to 90% accuracy after 10 epochs, while now accuracy is lower than 75% in the first 20 epochs. We are still looking into it, we will update with news if we have any

edit: Our policies in Rasa 1.10.8

policies:
- name: AugmentedMemoizationPolicy
- name: TEDPolicy
  max_history: 5
  epochs: 70
  #evaluate_on_number_of_examples: 0
  #evaluate_every_number_of_epochs: 2
  #tensorboard_log_directory: "./tensorboard"
  #tensorboard_log_level: "epoch"
- name: FallbackPolicy
  nlu_threshold: 0.5
  core_threshold: 0.4
- name: MappingPolicy
- name: FormPolicy

Our policies in Rasa 2.0

policies:
- name: AugmentedMemoizationPolicy
- name: TEDPolicy
  max_history: 5
  epochs: 70
  batch_size: [8,32]
- name: RulePolicy
  core_fallback_threshold: 0.4
  core_fallback_action_name: action_default_fallback

Ghostvv · January 21, 2021, 9:45am

could you please try 200 epochs. We did some internal changes to TED, so its loss requires more epochs to converge

kosniaz · January 21, 2021, 12:54pm

Thank you for your input Dr. Vlasov, here are the tensorboard plots from the 200 epochs training.

the parameters used:

- name: TEDPolicy
  max_history: 5
  epochs: 200
  batch_size: [8,32]
  evaluate_on_number_of_examples: 0
  evaluate_every_number_of_epochs: 2
  tensorboard_log_directory: "./tensorboard"
  tensorboard_log_level: "epoch"

We don’t see a big difference. Most probably we’re overfitting after some point. We have set evaluation stories to 0, because we have very few stories in general, just 87. However, we will re-train the model with 10 evaluation examples, and post the results here in half an hour. What changes have been introduced to the TED policy? I’ve only seen the default batch size in the changelog.

Ghostvv · January 21, 2021, 1:01pm

In order to accommodate big datasets, we change TED input to be sparse and therefore introduced a couple of dense layers similar as in DIET to process it.

How big is your training data? Does it have contradictions?

Ghostvv · January 21, 2021, 1:03pm

could you please try to add dense_dimension: 64 or even 128 option?

kosniaz · January 21, 2021, 1:38pm

We have rather few stories, 87. We have no contradictions. Setting dense_dimension: 64 or 128` didn’t help, here are the results for 64. With 128 it was slightly better.

Ghostvv · January 21, 2021, 1:53pm

very strange, do you mind sharing the data?

kosniaz · January 21, 2021, 2:17pm

I’m not sure I’m allowed to… Here are some of our stories:sample stories.yml (12.9 KB)

By the way, I’m not sure if it’s because of the difference in the TED models, but some behaviours don’t generalize very well in Rasa 2.0. Also, core fallbacks happen more frequently, especially for unhappy paths.

Ghostvv · January 21, 2021, 2:35pm

fallback is different, it’s because the confidence distribution shifts.

Could you run your trained TED policy on your training stories and compare mistakes

kosniaz · January 21, 2021, 2:42pm

I can’t think of a fast way to do this. You mean remove all polices but TED and then test manually each story? Or rewriting them as end-to-end test and run with rasa test?

Ghostvv · January 21, 2021, 3:41pm

you can remove all policies and run rasa test core with the address to your training stories

kosniaz · January 21, 2021, 4:35pm

Wow, I didn’t know there was such a command. results:

2021-01-21 18:06:41 INFO     rasa.core.test  - Finished collecting predictions.              
2021-01-21 18:06:41 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:                                                                               
2021-01-21 18:06:41 INFO     rasa.core.test  -  Correct:          66 / 80                                                                                              
2021-01-21 18:06:41 INFO     rasa.core.test  -  F1-Score:         0.904                                                                                                
2021-01-21 18:06:41 INFO     rasa.core.test  -  Precision:        1.000                                                             
2021-01-21 18:06:41 INFO     rasa.core.test  -  Accuracy:         0.825                                                                                                
2021-01-21 18:06:41 INFO     rasa.core.test  -  In-data fraction: 0.832                                                                                                
2021-01-21 18:06:41 INFO     rasa.core.test  - Stories report saved to results/story_report.json.                                                                      
2021-01-21 18:06:42 INFO     rasa.core.test  - Evaluation Results on ACTION level:                                                                                     2021-01-21 18:06:42 INFO     rasa.core.test  -  Correct:          1240 / 1273                                                                                          
2021-01-21 18:06:42 INFO     rasa.core.test  -  F1-Score:         0.974                                                                                                
2021-01-21 18:06:42 INFO     rasa.core.test  -  Precision:        0.976                                                                                                
2021-01-21 18:06:42 INFO     rasa.core.test  -  Accuracy:         0.974        
2021-01-21 18:06:42 INFO     rasa.core.test  -  In-data fraction: 0.832                                                                                                
2021-01-21 18:06:42 INFO     rasa.utils.plotting  - Confusion matrix, without normalization:                                                                           
[[ 1  0  0 ...  0  0  0]                                                                                                                                               
 [ 0 26  0 ...  0  0  0]                                                                                                                                               
 [ 0  0 50 ...  0  0  0]                                                                                                                                               
 ...                                                                                                                                                                   
 [ 0  0  0 ... 60  0  0]                                                                                                 
 [ 0  0  0 ...  0  7  0]                                                                
 [ 0  0  0 ...  0  0  4]]

For some reason, I see that some stories generated by rasa-interactive were not tested.

As for Rasa 1.10:

2021-01-21 18:19:48 INFO     rasa.core.test  - Finished collecting predictions.
2021-01-21 18:19:48 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:
2021-01-21 18:19:48 INFO     rasa.core.test  -  Correct:          68 / 76
2021-01-21 18:19:48 INFO     rasa.core.test  -  F1-Score:         0.944
2021-01-21 18:19:48 INFO     rasa.core.test  -  Precision:        1.000
2021-01-21 18:19:48 INFO     rasa.core.test  -  Accuracy:         0.895
2021-01-21 18:19:48 INFO     rasa.core.test  -  In-data fraction: 0.922
2021-01-21 18:19:48 INFO     rasa.core.test  - Evaluation Results on ACTION level:
2021-01-21 18:19:48 INFO     rasa.core.test  -  Correct:          1227 / 1238
2021-01-21 18:19:48 INFO     rasa.core.test  -  F1-Score:         0.991
2021-01-21 18:19:48 INFO     rasa.core.test  -  Precision:        0.991
2021-01-21 18:19:48 INFO     rasa.core.test  -  Accuracy:         0.991
2021-01-21 18:19:48 INFO     rasa.core.test  -  In-data fraction: 0.922
2021-01-21 18:19:48 INFO     rasa.core.test  -  Classification report:

Checking results/failed_test_stories I found some interesting story contradictions we had missed (undetected by rasa data validate). The accuracy on CONVERSATION level is close the accuracy shown in the training process.

What is curious though is that we have similar contradictions in rasa 1.10, the reported accuracy in training is 99%, which is the accuracy on ACTION level. Could it be that in Rasa 2, the accuracy displayed in the loading bar is the Conversation Accuracy while in Rasa 1, the accuracy is the Action Accuracy?

Ghostvv · January 21, 2021, 4:37pm

it should be both action accuracy. It looks like your data has changed: in 2.0 In-data fraction: 0.832, while in 1.10 In-data fraction: 0.922

kosniaz · January 21, 2021, 4:54pm

Interesting: from test.py I found the definition of In-data fraction:

    """Given a list of action items, returns the fraction of actions
    that were predicted using one of the Memoization policies."""

Ghostvv · January 22, 2021, 9:51am

yes, memoization policy, just a tool to measure how much data is directly from training data. if not all you training data is in the training data, it means there are contradictions

Topic		Replies	Views
TED fails - overfitting Rasa Open Source	16	989	February 2, 2021
Rasa 2.2 fails but 1.10 works Rasa Open Source	7	501	May 19, 2021
Bad accuracy rasa shell Rasa Open Source	1	874	January 28, 2020
Impact of Training loss on rasa core chatbot performance Rasa Open Source	3	743	January 28, 2021
Training_Data Tutorials, Resources & Videos	0	281	April 11, 2020

TED classifier has lower accuracy after migration to Rasa 2.0

Related topics