TED fails - overfitting

Hello there!

My team recently tried to migrate to Rasa 2.0. We’ve had several issues with the TED and especially that it doesn’t have the same metrics (t-loss, loss, accuracy) with the ones we got through training in Rasa 1.10.8.

We’ve tried to train TED for various hyperparameter settings (changed dense dimensions, epochs (30-200), batch sizes) but still got mediocre results (Rasa 1.10: 99% accuracy, Rasa 2.0: 84% accuracy).

Also, I would like to point out that theoretically the network would have to overfit. We’ve tried training with 40,100,300 epochs but the accuracy is still the same.

The policy used: image

I will attach some snipets from the training procedure.

I’ve also tried everything in the following thread: TED classifier has lower accuracy after migration to Rasa 2.0 .

Thus, I don’t think it’s a hyperparameter tuning problem. Maybe the embedding space is not that large but I’ve tried changing it also.

Is something wrong with the implementation of TED in Rasa 2.0?

PS. We’ve secured a higher accuracy (93.4%) by setting the max_history to 10 or 13 but this exponentially increases the training time. Also, as far as I’m concerned, for shallow conversations (with a depth of 5-7 dialogue turns) this shouldn’t produce better results. Our tests (either automatically from rasa test or from manual shell+actions testing) are usually 5-7 turns.

After some debugging (and after I’ve finally saw that _write_mode_summary is initiated if tensorboard logdir is present in cofig.yml) I’ve noticed some differences.

Rasa 1.10.x

Variables: name (type) [shape]

dialogue_encoder/layer_normalization_2/beta:0 (float32) [128]
dialogue_encoder/layer_normalization_2/gamma:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_6/bias:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_6/kernel:0 (float32) [512x128]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_5/bias:0 (float32) [512]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_5/kernel:0 (float32) [128x512]
dialogue_encoder/transformer_encoder_layer/layer_normalization_1/beta:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/layer_normalization_1/gamma:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_4/bias:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_4/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_3/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_2/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_1/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/layer_normalization/beta:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/layer_normalization/gamma:0 (float32) [128]
dialogue_encoder/dense_with_sparse_weights/bias:0 (float32) [128]
dialogue_encoder/dense_with_sparse_weights/kernel:0 (float32) [35x128]
embed_label/embed_layer_label/bias:0 (float32) [20]
embed_label/embed_layer_label/kernel:0 (float32) [25x20]
embed_dialogue/embed_layer_dialogue/bias:0 (float32) [20]
embed_dialogue/embed_layer_dialogue/kernel:0 (float32) [128x20]

Total size of variables: 205852

Rasa 2.0.x

Variables: name (type) [shape]

dialogue_encoder/layer_normalization_2/beta:0 (float32) [128]
dialogue_encoder/layer_normalization_2/gamma:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_6/bias:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_6/kernel:0 (float32) [512x128]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_5/bias:0 (float32) [512]
dialogue_encoder/transformer_encoder_layer/dense_with_sparse_weights_5/kernel:0 (float32) [128x512]
dialogue_encoder/transformer_encoder_layer/layer_normalization_1/beta:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/layer_normalization_1/gamma:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_4/bias:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_4/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_3/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_2/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/multi_head_attention/dense_with_sparse_weights_1/kernel:0 (float32) [128x128]
dialogue_encoder/transformer_encoder_layer/layer_normalization/beta:0 (float32) [128]
dialogue_encoder/transformer_encoder_layer/layer_normalization/gamma:0 (float32) [128]
dialogue_encoder/dense_with_sparse_weights/bias:0 (float32) [128]
dialogue_encoder/dense_with_sparse_weights/kernel:0 (float32) [100x128]
intent_sentence/bias:0 (float32) [20]
intent_sentence/kernel:0 (float32) [12x20]
action_name_sentence/bias:0 (float32) [20]
action_name_sentence/kernel:0 (float32) [17x20]
ffnn_label_action_name_sentence/hidden_layer_label_action_name_sentence_0/bias:0 (float32) [50]
ffnn_label_action_name_sentence/hidden_layer_label_action_name_sentence_0/kernel:0 (float32) [17x50]
ffnn_intent_sentence/hidden_layer_intent_sentence_0/bias:0 (float32) [50]
ffnn_intent_sentence/hidden_layer_intent_sentence_0/kernel:0 (float32) [20x50]
ffnn_action_name_sentence/hidden_layer_action_name_sentence_0/bias:0 (float32) [50]
ffnn_action_name_sentence/hidden_layer_action_name_sentence_0/kernel:0 (float32) [20x50]
embed_label/embed_layer_label/bias:0 (float32) [20]
embed_label/embed_layer_label/kernel:0 (float32) [50x20]
embed_dialogue/embed_layer_dialogue/bias:0 (float32) [20]
embed_dialogue/embed_layer_dialogue/kernel:0 (float32) [128x20]

Total size of variables: 218292

Thus, 13k more variables from the last section that changed.

Are you sure you use the same stories? Because in the link you posted, we found out that there were contradictions and amount of stories recalled by memoization policy changed

Thanks for your reply @Ghostvv!

Yes I’m almost sure. rasa data validate don’t produce any contradiction and we’ve had automated migration from markdown to yml. Thus, I think it’s highly unlikable for the stories to be changed.

in data fraction is the fraction of next actions that Memoization (or Augmented Memoization) predicted. In the forum, it is stated that Rule Policy and Memoization has a different priority. Thus, when both of them predict something with confidence 1.0, Rule policy is the one that is charged for the prediction and in data fraction falls.

Also, we’ve had several interactive stories appended to solve the problem and it seems that TEDs accuracy increased marginally.

I’ve also noticed that for the init chat bot, in Rasa 1.10.8 TED reaches 98% after the 3rd epoch where as in Rasa 2.0 its vastly different (85.8% after 100 epochs with great loss spikes up and down).

I just run rasa init and TED gets to 0.967

Epochs: 100%|████████████████████████████████████| 100/100 [00:06<00:00, 17.48it/s, t_loss=14.891, loss=14.722, acc=0.967]

I think the fact that it is not zero is due to dropout. when I run rasa test core, it gives accuracy of 1

Thanks again for your reply @Ghostvv.

Sorry for my late response.

Here are the results for the training of base core for rasa init. Thus data are the same.

  • Dark blue is TED with default parameters for Rasa 1.10.8
  • Less dark blue is TED with default parameters for Rasa 2.0.0
  • Green is TED with batch size [8,32] because the default value is different from the one used in Rasa 1.10.8.

TED from RASA 2.0 is different and produces more erratic losses.

Is there a way to bypass those 13k parameters through config?

Also, from further training, TED seems to produce vastly different results for each new training session.

Thus the changes are highly susceptible to weight initialization.

could you please install the latest rasa 2.2?

Rasa 2.2 produces roughly the same results as Rasa 2.0.

Light blue is for batch size [8,32] Dark blue is for default parameter TED

image

Still the problem persist. Thanks again for your time and for your quick replies.

thanks for the analysis. These 2 issues are probably related: Rasa 2.2.2 core model predictions off · Issue #7658 · RasaHQ/rasa · GitHub and Rasa 2.0.x fails to use stories where the previous events in the current conversation are different to the previous events in a story when making predictions · Issue #7221 · RasaHQ/rasa · GitHub

Hi, I have a theory why performance might drop. Could you please try mcfly-ted branch of rasa with your data?

2 Likes

Hello @Ghostvv.

I will try it asap and post it. I will also try it on the dummy mood bot as a reference for story validity.

Thanks again for your time.

@Ghostvv seems like it is a lot smoother! Also, rasa test core gave ~99% accuracy.

mcfly_ted_rasa2_2_master

Dummy bot TED training:

Also, at around epoch 75, the loss just gets 10x. It’s a little weird as a mean.

1 Like

amazing! I’ll prepare the PR. Don’t worry about the loss. We have scale_loss parameter that upscales the loss, so that examples that are hard to learn have stronger signal

1 Like

What is PR? :smiley:

I’ve tagged the mcfly-ted branch as the solution for the time being.

PR is pull request. That’s how we introduce changes like in mcfly-ted into the main branch and then into the later release

1 Like