Very high t_loss, but also with high m_acc and i_acc

During training, as described from the topic, i got the following figures

Epochs:  41% 122/300 [1:40:08<1:51:25, 37.56s/it, t_loss=102, m_acc=0.992, i_acc=0.999]

How should i interpret the result with such a high t_loss but also high m_acc and i_acc?

Thanks

1 Like

@mayuanyang1 can you please share the config.yml file for us?

version: “2.0” language: ch

pipeline:
  - name: HFTransformersNLP
    model_name: bert
    model_weights: bert-base-chinese
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
  - name: RegexFeaturizer
    use_word_boundaries: False
    case_sensitive: False
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 6
  - name: DIETClassifier
    epochs: 300
    embedding_dimension: 150
    number_of_transformer_layers: 3
    transformer_size: 320
    constrain_similarities: true
    entity_recognition: false
    use_masked_language_model: true
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: RulePolicy

@mayuanyang1 Alright, means you have shared the DIETClassifier training for 300 Epochs. Give me some time and I’ll get back to you.

@mayuanyang1 I hope your chat is working fine as per the use case?

thanks for your help. I have a model with about 900 intents and each one of them has about 30-50 examples, the intent classification seems not doing very well, any suggestion?

@mayuanyang1 always mention me for fast response :slight_smile: OK?

Ok, please try to minimize some of the intents you have a lot of intents, try to merge the intents, I assume your model is getting overfit or underfit ( confused). Even, try to delete all previous trained model and train again and check it provides you with better results or not?

Even, I’d recommend just mentioning at least 10-20 examples that are more than enough for training the model, but more you provide it’s also good but it will make training time increase a lot.

@mayuanyang1 what is your model training time in total when you train?

@nik202 i use colab, here are the figures

2022-01-31 05:37:26 INFO     rasa.nlu.featurizers.sparse_featurizer.count_vectors_featurizer  - 30301 vocabulary items were created for text attribute.
2022-01-31 05:37:38 INFO     rasa.nlu.model  - Finished training component.
2022-01-31 05:37:38 INFO     rasa.nlu.model  - Starting to train component CountVectorsFeaturizer
2022-01-31 05:37:41 INFO     rasa.nlu.featurizers.sparse_featurizer.count_vectors_featurizer  - 465804 vocabulary items were created for text attribute.
2022-01-31 05:37:55 INFO     rasa.nlu.model  - Finished training component.
2022-01-31 05:37:55 INFO     rasa.nlu.model  - Starting to train component DIETClassifier
/usr/local/lib/python3.7/dist-packages/rasa/utils/tensorflow/model_data_utils.py:395: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  np.array([v[0] for v in values]), number_of_dimensions=3
Epochs:  50% 150/300 [1:56:49<1:27:02, 34.82s/it, t_loss=102, m_acc=0.98, i_acc=0.998]  /usr/local/lib/python3.7/dist-packages/rasa/utils/tensorflow/model_data.py:750: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  np.concatenate(np.array(f)),
Epochs: 100% 300/300 [2:55:35<00:00, 35.12s/it, t_loss=44.4, m_acc=0.995, i_acc=0.999]

I pretty much delete the old model and train from scratch every-time, i have tried split the intents into 3 models, about 300 intents/model, the yield confidence is significantly higher, however this for me seems like a work around, would be good to have just 1 model, any suggestion would be appreciated

Welcome to the forum :slight_smile:

This post on Stack Exchange explains the situation rather well.


I strongly suggest using Tensorboard with Rasa to visualize training and validation. You can see an example of the config here.

Once trained, open Tensorboard. You will see two curves for each component you enabled Tensorboard for (if you’re having problems with intents, you should at least enable it for DIET). One curve is for training, the other is for validation:

image

Usually, the training curve mainly goes up, while the validation curve starts going down at some point. DIET will have two pairs of curves, one for intents and one for entities.

Note down the first epoch where the validation accuracy for both intents and entities reach a high enough level and set the number of epochs for DIET to that number. This will avoid overfitting, aka having a high accuracy on your training set but low accuracy on new data - similar to your case.

In your config, you should set evaluate_on_number_of_examples to about 20% of your training data. (E.g. for diet, if you have 1000 example, set it to 200). I wrote this small script to calculate it for you.

1 Like

Will definitely try them out, for evaluate_on_number_of_examples, as i have a not very well balanced of examples e.g. some intents might have 30 and others might just have 10, would DIET workout the split automatically?

I don’t have the answer to that, to be honest, but I think it should be evenly split. But then, why does this field have to be a number and not a percentage?

Anyway, I would suggest adding more examples to the intents that do not have many of them.

I also suggest setting random_seed to any value so that you get comparable results between trainings.

Have tried your suggestion with a much better result :+1:

Epochs: 100% 300/300 [1:45:35<00:00, 21.12s/it, t_loss=9.07, i_acc=1, val_t_loss=9.03, val_i_acc=0.999]

Is the loss still relatively high?

Glad you go it better :slight_smile:

Mind sharing your config.yml and Tensorboard results to see if I can suggest anything?

I find the loss is a bit high, but as long as your bot works as expected it doesn’t matter. In the end, when your bot starts talking to real users, you will have to check those conversations and use them as training data (misclassified intents and unsuccessful stories).

Here is my config, i just upgraded to 3.x with JiebaTokenizer and also with bert, it actually has errors, any thought?

version: "3.0"
language: zh
pipeline:
  - name: JiebaTokenizer
    dictionary_path: ./data/jieba_dict/
  - name: LanguageModelFeaturizer
    # Name of the language model to use
    model_name: "bert"
    # Pre-Trained weights to be loaded
    model_weights: "bert-base-chinese"
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 300
    constrain_similarities: true
    entity_recognition: false
    evaluate_on_number_of_examples: 6000
    evaluate_every_number_of_epochs: 5
    tensorboard_log_directory: "./tensorboard"
    tensorboard_log_level: "epoch"
    ranking_length: 5
    number_of_negative_examples: 20
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: RulePolicy

Here are the log outputs

2022-02-03 00:34:07 INFO     transformers.modeling_tf_utils  - loading weights file https://cdn.huggingface.co/bert-base-chinese-tf_model.h5 from cache at /root/.cache/torch/transformers/86a460b592673bcac3fe5d858ecf519e4890b4f6eddd1a46a077bd672dee6fe5.e6b974f59b54219496a89fd32be7afb020374df0976a796e5ccd3a1733d31537.h5
2022-02-03 00:34:12 INFO     transformers.modeling_tf_utils  - Layers from pretrained model not used in TFBertModel: ['nsp___cls', 'mlm___cls']
2022-02-03 00:36:21 INFO     rasa.engine.training.hooks  - Restored component 'CountVectorsFeaturizer' from cache.
2022-02-03 00:38:11 INFO     rasa.engine.training.hooks  - Restored component 'CountVectorsFeaturizer' from cache.
2022-02-03 00:40:17 INFO     rasa.engine.training.hooks  - Starting to train component 'DIETClassifier'.
Epochs:   0% 0/300 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/rasa/engine/graph.py", line 458, in __call__
    output = self._fn(self._component, **run_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/rasa/nlu/classifiers/diet_classifier.py", line 919, in train
    shuffle=False,  # we use custom shuffle inside data generator
  File "/usr/local/lib/python3.7/dist-packages/rasa/utils/tensorflow/temp_keras_modules.py", line 181, in fit
    tmp_logs = train_function(iterator)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 885, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 917, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3040, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 1964, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 596, in call
    ctx=ctx)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  ConcatOp : Dimensions of inputs should match: shape[0] = [64,33,128] vs. shape[1] = [64,32,768]
	 [[node rasa_sequence_layer_text/rasa_feature_combining_layer_text/concatenate_sparse_dense_features_text_sequence/concat (defined at /lib/python3.7/dist-packages/rasa/utils/tensorflow/rasa_layers.py:339) ]] [Op:__inference_train_function_719741]

Errors may have originated from an input operation.
Input Source operations connected to node rasa_sequence_layer_text/rasa_feature_combining_layer_text/concatenate_sparse_dense_features_text_sequence/concat:
 rasa_sequence_layer_text/rasa_feature_combining_layer_text/concatenate_sparse_dense_features_text_sequence/dropout/dropout/Mul_1 (defined at /lib/python3.7/dist-packages/rasa/utils/tensorflow/rasa_layers.py:309)	
 IteratorGetNext (defined at /lib/python3.7/dist-packages/rasa/utils/tensorflow/temp_keras_modules.py:181)

Function call stack:
train_function


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/rasa", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/rasa/__main__.py", line 121, in main
    cmdline_arguments.func(cmdline_arguments)
  File "/usr/local/lib/python3.7/dist-packages/rasa/cli/train.py", line 59, in <lambda>
    train_parser.set_defaults(func=lambda args: run_training(args, can_exit=True))
  File "/usr/local/lib/python3.7/dist-packages/rasa/cli/train.py", line 103, in run_training
    finetuning_epoch_fraction=args.epoch_fraction,
  File "/usr/local/lib/python3.7/dist-packages/rasa/api.py", line 117, in train
    finetuning_epoch_fraction=finetuning_epoch_fraction,
  File "/usr/local/lib/python3.7/dist-packages/rasa/model_training.py", line 171, in train
    **(nlu_additional_arguments or {}),
  File "/usr/local/lib/python3.7/dist-packages/rasa/model_training.py", line 232, in _train_graph
    is_finetuning=is_finetuning,
  File "/usr/local/lib/python3.7/dist-packages/rasa/engine/training/graph_trainer.py", line 105, in train
    graph_runner.run(inputs={PLACEHOLDER_IMPORTER: importer})
  File "/usr/local/lib/python3.7/dist-packages/rasa/engine/runner/dask.py", line 101, in run
    dask_result = dask.get(run_graph, run_targets)
  File "/usr/local/lib/python3.7/dist-packages/dask/local.py", line 558, in get_sync
    **kwargs,
  File "/usr/local/lib/python3.7/dist-packages/dask/local.py", line 496, in get_async
    for key, res_info, failed in queue_get(queue).result():
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.7/dist-packages/dask/local.py", line 538, in submit
    fut.set_result(fn(*args, **kwargs))
  File "/usr/local/lib/python3.7/dist-packages/dask/local.py", line 234, in batch_execute_tasks
    return [execute_task(*a) for a in it]
  File "/usr/local/lib/python3.7/dist-packages/dask/local.py", line 234, in <listcomp>
    return [execute_task(*a) for a in it]
  File "/usr/local/lib/python3.7/dist-packages/dask/local.py", line 225, in execute_task
    result = pack_exception(e, dumps)
  File "/usr/local/lib/python3.7/dist-packages/dask/local.py", line 220, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python3.7/dist-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/usr/local/lib/python3.7/dist-packages/rasa/engine/graph.py", line 467, in __call__
    ) from e
rasa.engine.exceptions.GraphComponentException: Error running graph component for node train_DIETClassifier4.

I don’t know how to help with this, please create a new topic as this is now different than the initial question.

@mayuanyang1 can you please close this thread, if you got the solution on this thread? :slight_smile:

2 Likes