Visualize training metrics in TensorBoard

With Rasa Open Source 1.9, we added support for TensorBoard. TensorBoard provides visualizations and tooling for machine learning experiments. In Rasa Open Source 1.9 we use TensorBoard to visualize training metrics of our in-house built machine learning models, i.e. EmbeddingIntentClassifier, DIETClassifier, ResponseSelector, EmbeddingPolicy, and TEDPolicy. Visualizing training metrics help you to understand if your model has trained properly. You can, for example, see if you have trained your model long enough, i.e. if you have specified the correct number of epochs. If you enable the option to evaluate you model every x number of epochs on a hold out validation dataset, via the options evaluate_on_number_of_examples and evaluate_every_number_of_epochs, you can also see if your model generalizes well and does not overfit.

How to enable TensorBoard?

To enable TensorBoard you need to set the model option tensorboard_log_directory to a valid directory in your config.yml file. You can set this option for EmbeddingIntentClassifier, DIETClasifier, ResponseSelector, EmbeddingPolicy, or TEDPolicy. If a valid directory is provided, the training metrics will be written to that directory during training. By default we write the training metrics after every epoch. If you want to write the training metrics for every training step, i.e. after every minibatch, you can set the option tensorboard_log_level to "minibatch" instead of "epoch" in your config.yml file.

After you trained your model, for example via rasa train, all metrics are written to the provided directory. The directory will contain a subdirectory with the model name and another subdirectory with a timestamp. This allows you to reuse the same directory for multiple models and runs. To start TensorBoard execute the following command:

tensorboard --logdir <path-to-directory>

Once you open a browser at http://localhost:6006/ you can see the training metrics.

Let’s take a look at an example

The following config was used to train Sara.

pipeline:
  - name: WhitespaceTokenizer
  - name: CRFEntityExtractor
  - name: CountVectorsFeaturizer
    OOV_token: "oov"
    token_pattern: (?u)\b\w+\b
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: EmbeddingIntentClassifier
    epochs: 50
    ranking_length: 5
    evaluate_on_number_of_examples: 500
    evaluate_every_number_of_epochs: 5
    tensorboard_log_directory: ".tensorboard"
    tensorboard_log_level: "epoch"

As you can see that we specified a TensorBoard log directory. We also specified that we want to evaluate our model every 5 epochs on hold out validation dataset. After we trained the model we can see the training metrics in TensorBoard.

The orange curve corresponds to the hold out validation dataset and the blue curve shows the metrics for the training data (see legend on the left). If you use, for example the DIETClassifier, you will see plots for the following metrics: i_loss, i_acc, e_loss, e_f1, and t_loss. i is short for intent, e for entities, and t_loss shows the total loss.

We might add further support for TensorBoard in the future. Until then we would love to hear your feedback and ideas what else we could add to TensorBoard.

10 Likes

Hi @Tanja,

if anyone stumbles upon a blank white page after following your tutorial: this is an already known issue which can be fixed by downgrading tensorboard to 2.0.0.

Thanks for your explanation!

Kind regards
Julian

2 Likes

Hi @Tanja. I followed this blogpost.

In the example shown there, I’m referring to this particular setting.
evaluate_on_number_of_examples: 0
This means that the validation (~hold out) dataset has zero examples(empty dataset) right? Then how is the orange curve in the following graph plotted?
On what data did it get evaulated since our validation set is empty?
(Please correct me if my understanding is wrong.)

@Akhil Good catch! You are completely right if you say, that there should not be any orange line if we set evaluate_on_number_of_examples: 0. I think the plot in the blogpost comes actually from the config shown at the beginning of this thread. It is the same picture. Sorry for the confusion!

1 Like

Thank you, @Tanja for your quick response and clarification.

Hi @Tanja.

Lets say I have 3011 intent examples(30 distinct intents) and 224 entity examples (5 distinct entities). In addition, also have 2 Response Selectors with 1746 examples in total.

  1. What would be a good value for the parameter evaluate_on_number_of_examples?
  2. If I give a value of 300(~10%), how many intent examples and how many entity examples will it choose for the validation set?
  3. Will it also choose from responses.md file (response selector examples) ?

@Akhil Taking ~10% of the total examples as validation data is a good approach. So ~300 for evaluate_on_number_of_examples sound like a good number in your case. We randomly pick those examples from the training examples. We guarantee that at least one example per intent will end up in the validation data. We don’t have any guarantee for entities. But normally you will also find a couple of entities in the validation data, especially if your training examples contain quite a lot entities. If you want to also validate the ResponseSelector during training, you need to set the option evaluate_on_number_of_examples for the ResponseSelector in the config.yml file. It is separate from the DIETClassifier. Hope that clarifies your questions.

1 Like

Thank you, @Tanja for such a clear explanation :slightly_smiling_face:

@Tanja, I have run it for the DIET component with 300 epochs and 300 examples(~10%) as the validation set. These are my graphs. I couldn’t decide the number of epochs I have to choose. Could you please guide me on this?

smoothing: 0

blue: train_set

red : validation_set

Entity f1: entity_f1 Entity Loss: entity_loss Intent Accuracy i_acc Intent Loss : i_loss Total Loss : t_loss

These are my observations:

  1. entity_f1: highest at the 300th step.
  2. entity_loss: low at steps 54,130 and 284.
  3. Intent_accuracy: high at 75, 260.
  4. Intent_loss: low at steps 115 and 185.
  5. total_loss: min at 115 and 300.

How can I decide the number of steps if each of the metrics is good at different and far away steps?

It seems like that the loss is slightly increasing after ~125 epochs. So I would go with that number.

1 Like

Hi @Tanja. By loss do you mean entity loss?

Loss in general. You see it especially on the entity loss, but also intent loss tends to go upwards again (speaking of the validation set). If the training loss further decreases but the validation loss increases that could indicate overfitting.

1 Like

Yeah got it. Thank you @Tanja for such a clear and detailed explanation.

Hi Tanja, is there a way to specify validation dataset instead of letting rasa to decide?

@MartinBrisiak Unfortunately no. However, we have an issue for that (Add possibility to specify validation dataset · Issue #5747 · RasaHQ/rasa · GitHub), so we plan to add it at some point.

Hello,

Is there an alternative for evaluate_on_number_of_examples where we pass a percentage instead of a number?

It would be easier, especially if the chatbot is still in development and the number of intents and examples is always changing. It avoids counting the number of examples we have anyway.

3 Likes

Hey @Tanja is it possible to use TensorBoard when performing cross-validation? There is a couple of issues I’m dealing with.

  1. After performing cross-validation for a config file with TensorBoard enabled for DIETClassifier and with evaluate_on_number_of_examples: 0, TensorBoard visualizations are generated only for “train”. “test” scaler graphs are empty. I know its the expected behavior but, since cross-validation creates train-test split randomly in each fold/run shouldn’t there be “test” scaler graphs generated as well without having to specify a test set explicitly using evaluate_on_number_of_examples parameter? (even if evaluate_on_number_of_examples: 0 is given, shouldn’t there be scaler graphs generated for test set evaluated by cross-validation?)

  2. I noticed that when doing cross-validation TensorBoard visualizations are being created separately for different runs and folds. Is there a way to get a summarized visualization for all runs/folds? For example, if I run rasa test nlu --cross-validation --runs 2 --folds 2 with TensorBoard enabled for DIET classifier, 4 different output directories are being created and not one summarized output across all runs/folds.

Let me know if the question is not clear enough. (Didn’t attach the config file because I thought its irrelevant).

1 Like

Up, same question

Hello, when I look at the plots my test plot is not completed for the whole epochs. the blue graph shown below. please let me know what might be the issue. I am also attaching the config file code screenshot. I tried different evaluate_on_number_of_examples= 100,200,300,400 but the same issue the blue plot for test is not completed for the whole epochs.

1 Like

@rayvafa I also have the same problem! my validation line stop record after a few steps! Have you solved this problem?