Issue training NLU on GPU when adding a few more training samples (3k samples total)

Hey everyone!

I am training my NLU with the DIET classifier locally on GPU and I started receiving a ResourceExhaustedError error (see full error below) after adding a few more training samples (total of 3319 samples, 29 intents).

Any idea what I can tweak to avoid this?

It worked fine until today when I added something like 50 samples.

Thanks for your help! Nicolas

2020-12-08 13:27:47 INFO     rasa.shared.nlu.training_data.training_data  - Number of intent examples: 3319 (29 distinct intents)
...
2020-12-08 13:27:53 INFO     rasa.nlu.model  - Starting to train component DIETClassifier
...
Epochs:  26%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                                                                  | 13/50 [01:05<02:55,  4.76s/it, t_loss=24.248, i_acc=0.895, e_f1=0.770, r_f1=0.000]Traceback (most recent call last):
  File "/home/nicolas/anaconda3/envs/chatbot2.0/bin/rasa", line 8, in <module>
    sys.exit(main())
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/rasa/__main__.py", line 116, in main
    cmdline_arguments.func(cmdline_arguments)
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/rasa/cli/train.py", line 159, in train_nlu
    domain=args.domain,
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/rasa/train.py", line 470, in train_nlu
    domain=domain,
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/rasa/utils/common.py", line 308, in run_in_loop
    result = loop.run_until_complete(f)
  File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/rasa/train.py", line 512, in _train_nlu_async
    additional_arguments=additional_arguments,
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/rasa/train.py", line 547, in _train_nlu_with_validated_data
    **additional_arguments,
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/rasa/nlu/train.py", line 114, in train
    interpreter = trainer.train(training_data, **kwargs)
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/rasa/nlu/model.py", line 204, in train
    updates = component.train(working_data, self.config, **context)
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 777, in train
    self.component_config[BATCH_STRATEGY],
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/rasa/utils/tensorflow/models.py", line 206, in fit
    self.train_summary_writer,
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/rasa/utils/tensorflow/models.py", line 381, in _batch_loop
    call_model_function(batch_in)
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 807, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/home/nicolas/anaconda3/envs/chatbot2.0/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[114,137] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node cond_1/else/_115/cond_1/scan/while/body/_1133/cond_1/scan/while/ReduceLogSumExp/Max}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_train_on_batch_15705]

Function call stack:
train_on_batch

Cheers, Nicolas

So, adding batch_size: [32, 96] fixed the issue (but slowed down the training a bit).

Investigating a bit further, I realized that the issue came from one very very long training sample (1268 characters). Splitting this message in smaller messages or removing it altogether fixed the issue.

@Tanja Maybe you know this: I’m wondering if there is a way of forcing DIET to reduce its input size? That way, I would not need to alter the training data (which came from our live bot)

Thanks! Nicolas

Makes sense that the very long training sample is causing issues. Depending on the pipeline you have the featurized version of this training sample can get quite large.

We don’t have any check in place that splits such long training samples. I guess the easiest way to fix this on your side is to add a custom component that checks if a user message exceeds a certain threshold of characters and if it does it ignores it or splits it up.

1 Like

Thanks so much for your quick reply @Tanja! Indeed, I have the RegexFeaturizer, SpacyFeaturizer, and LexicalSyntacticFeaturizer so I suppose the featurized version is getting very large with long messages.

We don’t have any check in place that splits such long training samples. I guess the easiest way to fix this on your side is to add a custom component that checks if a user message exceeds a certain threshold of characters and if it does it ignores it or splits it up.

Good idea, we could maybe just remove the end if it’s too long. What do you mean by β€œsplits it up”? Can we process the message as two (or more) messages (and hence intents) within the pipeline? If that’s possible, would they be sent sequentially? That could be interesting to do for other type of query like multiple questions in one message (our users tend to do that on the very first message).

Good idea, we could maybe just remove the end if it’s too long.

Yeah, I think that would be the easiest.

What do you mean by β€œsplits it up” ?

The interface for train looks like this

train(
        self,
        training_data: TrainingData,
        config: Optional[RasaNLUModelConfig] = None,
        **kwargs: Any,
    )

So, you have access to the complete training data, which you can also modify, e.g. you could create a new Message and add it to the training_examples. Never tried it before and not sure if someone else did, but theoretically you should be possible to add new examples to the training data object. But no guarantee that this really works. I guess you simply need to try it out. However, your component needs to come first in the pipeline, otherwise the other components might fail.

Understood, thanks for the hints! I’ll probably go for the easy way for now :slight_smile: