Running OOM in TEDPolicy with 900 stories (1.10.0)

Hey folks, was upgrading my Rasa version from 1.4.6 to 1.10.0 and using the new TEDPolicy that I have access to, but I’m running into an issue with memory when trying to train a new model. I have 16Gb of memory and haven’t had issues training with the KerasPolicy previously. I’ve noticed that once the core training kicks in (after tracker processing) memory fills up then the training session crashes.

This is my current configuration:

config.yml
language: "en"

pipeline:
  - name: ConveRTTokenizer
  - name: ConveRTFeaturizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100
  - name: DucklingHTTPExtractor
    url: http://duckling:8000
    dimensions:
      - time
      - number
      - phone-number
    locale: en_US
    timezone: America/New_York

policies:
  - name: TwoStageFallbackPolicy
    nlu_threshold: 0.5
    ambiguity_threshold: 0.01
    core_threshold: 0.01
    fallback_core_action_name: action_default_fallback
    fallback_nlu_action_name: flag_conversation_for_review
    deny_suggestion_intent_name: incorrect_intent
  - name: AugmentedMemoizationPolicy
    max_history: 10
  - name: MappingPolicy
  - name: TEDPolicy
    epochs: 300
    max_history: 5
    batch_size: 8
    featurizer:
      - name: MaxHistoryTrackerFeaturizer
        state_featurizer:
          - name: LabelTokenizerSingleStateFeaturizer

I’ve messed around and changed from a linear batch_size to just leaving it as a single value, but something that I ran into when batch_size was set to [64, 128] was this message, which I’m guessing is the same issue I’m running into even with a batch_size of 8, though in that case the process is just being killed. I don’t know exactly the internals for how this stuff works, but I’m guessing it’s because it’s trying to load the entire training data set in memory as a single array?

The command I was using to train was:

rasa train --data data/interactive data/nlu data/stories --augmentation 0
64-128.log
2020-04-29 18:08:19 INFO     rasa.model  - Data (core-config) for Core model section changed.
2020-04-29 18:08:19 INFO     rasa.model  - Data (nlu-config) for NLU model section changed.
2020-04-29 18:08:19 INFO     rasa.model  - Data (nlg) for NLG templates section changed.
Training Core model...
Processed Story Blocks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:00<00:00, 1148.89it/s, # trackers=1]
Processed trackers: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 915/915 [00:00<00:00, 945.76it/s, # actions=3530]
Processed actions: 3530it [00:00, 12246.45it/s, # examples=3448]
Processed trackers: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 915/915 [00:00<00:00, 1132.06it/s, # actions=3530]
2020-04-29 18:08:44.184014: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
Traceback (most recent call last):
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/tensorflow_core/python/eager/context.py", line 1897, in execution_mode
    yield
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 659, in _next_internal
    output_shapes=self._flat_output_shapes)
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_dataset_ops.py", line 2479, in iterator_get_next_sync
    _ops.raise_from_not_ok_status(e, name)
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: MemoryError: Unable to allocate 17.6 GiB for an array with shape (531559, 1, 10, 889) and data type int32
Traceback (most recent call last):

  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 236, in __call__
    ret = func(*args)

  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 789, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/utils/tensorflow/model_data.py", line 402, in _gen_batch
    data = self._balanced_data(data, batch_size, shuffle)

  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/utils/tensorflow/model_data.py", line 386, in _balanced_data
    final_data[k].append(np.concatenate(np.array(v)))

MemoryError: Unable to allocate 17.6 GiB for an array with shape (531559, 1, 10, 889) and data type int32


         [[{{node PyFunc}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:IteratorGetNextSync]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/bin/rasa", line 8, in <module>
    sys.exit(main())
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/__main__.py", line 91, in main
    cmdline_arguments.func(cmdline_arguments)
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/cli/train.py", line 76, in train
    additional_arguments=extract_additional_arguments(args),
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/train.py", line 50, in train
    additional_arguments=additional_arguments,
  File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/train.py", line 101, in train_async
    additional_arguments,
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/train.py", line 188, in _train_async_internal
    additional_arguments=additional_arguments,
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/train.py", line 223, in _do_training
    additional_arguments=additional_arguments,
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/train.py", line 361, in _train_core_with_validated_data
    additional_arguments=additional_arguments,
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/core/train.py", line 66, in train
    agent.train(training_data, **additional_arguments)
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/core/agent.py", line 707, in train
    self.policy_ensemble.train(training_trackers, self.domain, **kwargs)
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/core/policies/ensemble.py", line 124, in train
    policy.train(training_trackers, domain, **kwargs)
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/core/policies/ted_policy.py", line 325, in train
    batch_strategy=self.config[BATCH_STRATEGY],
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/utils/tensorflow/models.py", line 126, in fit
    ) = self._get_tf_train_functions(eager, model_data, batch_strategy)
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/utils/tensorflow/models.py", line 342, in _get_tf_train_functions
    train_dataset_function, self.train_on_batch, eager, "train"
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/utils/tensorflow/models.py", line 324, in _get_tf_call_model_function
    tf_call_model_function(next(iter(init_dataset)))
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 630, in __next__
    return self.next()
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 674, in next
    return self._next_internal()
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 665, in _next_internal
    return structure.from_compatible_tensor_list(self._element_spec, ret)
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/tensorflow_core/python/eager/context.py", line 1900, in execution_mode
    executor_new.wait()
  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/tensorflow_core/python/eager/executor.py", line 67, in wait
    pywrap_tensorflow.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: MemoryError: Unable to allocate 17.6 GiB for an array with shape (531559, 1, 10, 889) and data type int32
Traceback (most recent call last):

  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 236, in __call__
    ret = func(*args)

  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 789, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/utils/tensorflow/model_data.py", line 402, in _gen_batch
    data = self._balanced_data(data, batch_size, shuffle)

  File "/home/kevin/.cache/pypoetry/virtualenvs/venus-jw1ULBKI-py3.7/lib/python3.7/site-packages/rasa/utils/tensorflow/model_data.py", line 386, in _balanced_data
    final_data[k].append(np.concatenate(np.array(v)))

MemoryError: Unable to allocate 17.6 GiB for an array with shape (531559, 1, 10, 889) and data type int32


         [[{{node PyFunc}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
**
8.log
2020-04-29 18:43:05 INFO     rasa.model  - Data (core-config) for Core model section changed.
2020-04-29 18:43:05 INFO     rasa.model  - Data (nlu-config) for NLU model section changed.
2020-04-29 18:43:05 INFO     rasa.model  - Data (nlg) for NLG templates section changed.
Training Core model...
Processed Story Blocks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:00<00:00, 1079.35it/s, # trackers=1]
Processed trackers: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 915/915 [00:00<00:00, 941.63it/s, # actions=3530]
Processed actions: 3530it [00:00, 12003.19it/s, # examples=3448]
Processed trackers: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 915/915 [00:01<00:00, 848.67it/s, # actions=2800]
2020-04-29 18:43:31.413820: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
Killed

Not sure if it’s relevant, but did some digging around and found this issue closed in January where @akelad posted this comment (maybe the issue’s back?):

we’ve been experiencing some memory errors ourselves, it might just be that the array it’s about to create would be too big to fit into memory. The point where it breaks is when it’s converting a scipy sparse array into a numpy array – the numpy array is much bigger than the scipy sparse array which is probably what’s causing that.

Just tested by reverting to the KerasPolicy and training proceeds as it did before; is there something I’m missing in the setup or is the TEDPolicy not suitable for this amount of stories?

@niveK my comment on that issue was about OOM errors with the NLU pipelines.Those specific issues have been addressed now.

As for OOM with TEDPolicy - have you tried setting the max_history parameter to something? By default it will train on the full story history, which means the trackers will be super long. I would suggest trying e.g. 5 as a first step. This is described in the warning here

Included in my original post, my config.yml includes a max_history of 5.

What was interesting was that I removed the featurizer specification in my original configuration (with Keras this was necessary, I believe) and just specified the max_history value in the TED Policy (below) to match the examples found in the rasa repo, I found that I was able to begin the training session without much issue (memory usage hanging around 6 Gib). Could there be some kind of duplication happening in the event that someone specifies the state featurizer? My gut tells me that specifying them (even if redundantly) shouldn’t cause memory usage to blow up.

I think the wording of the warning in the docs was a little unclear to me since it mentioned specifying the MaxHistoryTrackerFeaturizer and LabelTokenizerSingleStateFeaturizer (which made me feel I should specify them as I did in Keras) but after checking the source code, it seems that those are default if max_history is set. Think this might have just been a misreading on my part, but I hope it isn’t an issue for others.

1 Like

Sorry for the late reply. And oops I missed the max_history setting in your original config, because I was expecting it in the MaxHistoryTrackerFeaturizer… I will admit our docs page isn’t super clear on how these paramaters work, and the way to configure max_history in TED is also not super intuitive. I will open issues for both of these things. To summarise it for now:

  • If you set only the max_history parameter on the TEDPolicy level (without specifying featurizers), that will automatically use the MaxHistoryTrackerFeaturizer with whatever you’ve specified as your max_history

  • If you explicitly specify which featurizer the policy should use, then the max_history needs to be set on the featurizer level like this:

    featurizer:
      - name: MaxHistoryTrackerFeaturizer
        max_history: 5
        state_featurizer:
          - name: LabelTokenizerSingleStateFeaturizer
1 Like

The two issues for reference: