TED fails - overfitting

JVakiss · January 26, 2021, 10:30am

Hello there!

My team recently tried to migrate to Rasa 2.0. We’ve had several issues with the TED and especially that it doesn’t have the same metrics (t-loss, loss, accuracy) with the ones we got through training in Rasa 1.10.8.

We’ve tried to train TED for various hyperparameter settings (changed dense dimensions, epochs (30-200), batch sizes) but still got mediocre results (Rasa 1.10: 99% accuracy, Rasa 2.0: 84% accuracy).

Also, I would like to point out that theoretically the network would have to overfit. We’ve tried training with 40,100,300 epochs but the accuracy is still the same.

The policy used:

I will attach some snipets from the training procedure.

I’ve also tried everything in the following thread: TED classifier has lower accuracy after migration to Rasa 2.0 .

Thus, I don’t think it’s a hyperparameter tuning problem. Maybe the embedding space is not that large but I’ve tried changing it also.

Is something wrong with the implementation of TED in Rasa 2.0?

PS. We’ve secured a higher accuracy (93.4%) by setting the max_history to 10 or 13 but this exponentially increases the training time. Also, as far as I’m concerned, for shallow conversations (with a depth of 5-7 dialogue turns) this shouldn’t produce better results. Our tests (either automatically from rasa test or from manual shell+actions testing) are usually 5-7 turns.

JVakiss · January 27, 2021, 11:08am

After some debugging (and after I’ve finally saw that _write_mode_summary is initiated if tensorboard logdir is present in cofig.yml) I’ve noticed some differences.

Rasa 1.10.x

Variables: name (type) [shape]

dialogue_encoder/layer_normalization_2/beta:0 (float32) [128]
dialogue_encoder/layer_normalization_2/gamma:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_6/bias:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_6/kernel:0 (float32) [512x128]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_5/bias:0 (float32) [512]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_5/kernel:0 (float32) [128x512]
dialogue_encoder/transformer_encoder_layer/layer_normalization_1/beta:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/layer_normalization_1/gamma:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_4/bias:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_4/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_3/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_2/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_1/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/layer_normalization/beta:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/layer_normalization/gamma:0 (float32) [128]
dialogue_encoder/dense_with_sparse_weights/bias:0 (float32) [128]
dialogue_encoder/dense_with_sparse_weights/kernel:0 (float32) [35x128]
embed_label/embed_layer_label/bias:0 (float32) [20]
embed_label/embed_layer_label/kernel:0 (float32) [25x20]
embed_dialogue/embed_layer_dialogue/bias:0 (float32) [20]
embed_dialogue/embed_layer_dialogue/kernel:0 (float32) [128x20]

Total size of variables: 205852

Rasa 2.0.x

Variables: name (type) [shape]

dialogue_encoder/layer_normalization_2/beta:0 (float32) [128]
dialogue_encoder/layer_normalization_2/gamma:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_6/bias:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_6/kernel:0 (float32) [512x128]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_5/bias:0 (float32) [512]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_5/kernel:0 (float32) [128x512]
dialogue_encoder/transformer_encoder_layer/layer_normalization_1/beta:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/layer_normalization_1/gamma:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_4/bias:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_4/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_3/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_2/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_1/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/layer_normalization/beta:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/layer_normalization/gamma:0 (float32) [128]
dialogue_encoder/dense_with_sparse_weights/bias:0 (float32) [128]
dialogue_encoder/dense_with_sparse_weights/kernel:0 (float32) [100x128]
intent_sentence/bias:0 (float32) [20]
intent_sentence/kernel:0 (float32) [12x20]
action_name_sentence/bias:0 (float32) [20]
action_name_sentence/kernel:0 (float32) [17x20]
ffnn_label_action_name_sentence/hidden_layer_label_action_name_sentence_0/bias:0 (float32) [50]
ffnn_label_action_name_sentence/hidden_layer_label_action_name_sentence_0/kernel:0 (float32) [17x50]
ffnn_intent_sentence/hidden_layer_intent_sentence_0/bias:0 (float32) [50]
ffnn_intent_sentence/hidden_layer_intent_sentence_0/kernel:0 (float32) [20x50]
ffnn_action_name_sentence/hidden_layer_action_name_sentence_0/bias:0 (float32) [50]
ffnn_action_name_sentence/hidden_layer_action_name_sentence_0/kernel:0 (float32) [20x50]
embed_label/embed_layer_label/bias:0 (float32) [20]
embed_label/embed_layer_label/kernel:0 (float32) [50x20]
embed_dialogue/embed_layer_dialogue/bias:0 (float32) [20]
embed_dialogue/embed_layer_dialogue/kernel:0 (float32) [128x20]

Total size of variables: 218292

Thus, 13k more variables from the last section that changed.

Ghostvv · January 28, 2021, 11:58am

Are you sure you use the same stories? Because in the link you posted, we found out that there were contradictions and amount of stories recalled by memoization policy changed

JVakiss · January 28, 2021, 1:17pm

Thanks for your reply @Ghostvv!

Yes I’m almost sure. rasa data validate don’t produce any contradiction and we’ve had automated migration from markdown to yml. Thus, I think it’s highly unlikable for the stories to be changed.

in data fraction is the fraction of next actions that Memoization (or Augmented Memoization) predicted. In the forum, it is stated that Rule Policy and Memoization has a different priority. Thus, when both of them predict something with confidence 1.0, Rule policy is the one that is charged for the prediction and in data fraction falls.

Also, we’ve had several interactive stories appended to solve the problem and it seems that TEDs accuracy increased marginally.

I’ve also noticed that for the init chat bot, in Rasa 1.10.8 TED reaches 98% after the 3rd epoch where as in Rasa 2.0 its vastly different (85.8% after 100 epochs with great loss spikes up and down).

Ghostvv · January 28, 2021, 2:48pm

I just run rasa init and TED gets to 0.967

Epochs: 100%|████████████████████████████████████| 100/100 [00:06<00:00, 17.48it/s, t_loss=14.891, loss=14.722, acc=0.967]

Ghostvv · January 28, 2021, 2:50pm

I think the fact that it is not zero is due to dropout. when I run rasa test core, it gives accuracy of 1

JVakiss · January 28, 2021, 7:27pm

Thanks again for your reply @Ghostvv.

Sorry for my late response.

Here are the results for the training of base core for rasa init. Thus data are the same.

Dark blue is TED with default parameters for Rasa 1.10.8
Less dark blue is TED with default parameters for Rasa 2.0.0
Green is TED with batch size [8,32] because the default value is different from the one used in Rasa 1.10.8.

TED from RASA 2.0 is different and produces more erratic losses.

Is there a way to bypass those 13k parameters through config?

JVakiss · January 28, 2021, 7:32pm

Also, from further training, TED seems to produce vastly different results for each new training session.

Thus the changes are highly susceptible to weight initialization.

Ghostvv · January 28, 2021, 10:50pm

could you please install the latest rasa 2.2?

JVakiss · January 29, 2021, 7:43am

Rasa 2.2 produces roughly the same results as Rasa 2.0.

Light blue is for batch size [8,32] Dark blue is for default parameter TED

Still the problem persist. Thanks again for your time and for your quick replies.

Ghostvv · January 29, 2021, 11:45am

thanks for the analysis. These 2 issues are probably related: Rasa 2.2.2 core model predictions off · Issue #7658 · RasaHQ/rasa · GitHub and Rasa 2.0.x fails to use stories where the previous events in the current conversation are different to the previous events in a story when making predictions · Issue #7221 · RasaHQ/rasa · GitHub

Ghostvv · February 1, 2021, 4:34pm

Hi, I have a theory why performance might drop. Could you please try mcfly-ted branch of rasa with your data?

JVakiss · February 2, 2021, 8:11am

Hello @Ghostvv.

I will try it asap and post it. I will also try it on the dummy mood bot as a reference for story validity.

Thanks again for your time.

JVakiss · February 2, 2021, 9:53am

@Ghostvv seems like it is a lot smoother! Also, rasa test core gave ~99% accuracy.

mcfly_ted_rasa2_2_master

Dummy bot TED training:

Also, at around epoch 75, the loss just gets 10x. It’s a little weird as a mean.

Ghostvv · February 2, 2021, 9:59am

amazing! I’ll prepare the PR. Don’t worry about the loss. We have scale_loss parameter that upscales the loss, so that examples that are hard to learn have stronger signal

JVakiss · February 2, 2021, 10:07am

What is PR?

I’ve tagged the mcfly-ted branch as the solution for the time being.

Ghostvv · February 2, 2021, 10:22am

PR is pull request. That’s how we introduce changes like in mcfly-ted into the main branch and then into the later release

Topic		Replies	Views
TED classifier has lower accuracy after migration to Rasa 2.0 Rasa Open Source	16	952	January 22, 2021
Training Keras hyperparameters Rasa Open Source	1	352	July 2, 2020
Questions about TED model Rasa Open Source	2	391	February 26, 2022
NLU Performance Rasa Open Source	4	590	June 18, 2020
Entity Extraction Failure Feedback on Rasa Open Source entity , ted	0	334	October 7, 2022

TED fails - overfitting

Related topics