NLU training failing

r4sn4 · November 14, 2019, 2:09pm

I am using rasa version = 1.3.6 Data size is around 5.6 MB. Number of intents = 1412. Using tensorflow-gpu version 1.14.0. Also used tensoflow=1.14.0 for experiment.

Config.yml is as follows:-

pipeline:
- name: WhitespaceTokenizer
- name: CRFEntityExtractor
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 4
  max_ngram: 6
- name: EmbeddingIntentClassifier
  epochs: 150

policies:
- max_history: 1
  name: MemoizationPolicy
- core_threshold: 0.3
  name: FallbackPolicy
  nlu_threshold: 0.8
- name: FormPolicy
- name: MappingPolicy

When I run training on this configuration, I am getting following exception -

 MemoryError: Unable to allocate array with shape (168287, 38324) and data type int64

 [[{{node PyFunc}}]]
 Hint: If you want to see a list of allocated tensors when OOM happens, add 
 report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[IteratorGetNext]]
 Hint: If you want to see a list of allocated tensors when OOM happens, add 
 report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Shape/_7]]
Hint: If you want to see a list of allocated tensors when OOM happens, add 
report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: MemoryError: Unable to allocate array with shape (168287, 38324) and 
data type int64
Traceback (most recent call last):

File "/home/ubuntu/rasa_env/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 
209, in __call__
ret = func(*args)

File "/home/ubuntu/rasa_env/lib/python3.6/site- 
packages/tensorflow/python/data/ops/dataset_ops.py", line 514, in generator_py_func
values = next(generator_state.get_iterator(iterator_id))

File "/home/ubuntu/rasa_env/lib/python3.6/site-packages/rasa/utils/train_utils.py", line 201, in 
gen_batch
session_data = balance_session_data(session_data, batch_size, shuffle)

File "/home/ubuntu/rasa_env/lib/python3.6/site-packages/rasa/utils/train_utils.py", line 184, in 
balance_session_data
Y=np.concatenate(new_Y),

File "<__array_function__ internals>", line 6, in concatenate

MemoryError: Unable to allocate array with shape (168287, 38324) and data type int64


 [[{{node PyFunc}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add 
report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[IteratorGetNext]]
Hint: If you want to see a list of allocated tensors when OOM happens, add 
report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

I tried with different analyzer -

 - name: CountVectorsFeaturizer
  analyzer: word

In this case training happened successfully.

 Training time was around 2.5 hours on tensorflow and 
 Training time on tensorflow-gpu was 1.25 hours.
 Numebr of GPU =1 in this case.

I tried with more number of GPU but still issue happening with char_wb analyzer as tensorflow using only device:0 , no other GPU is getting utillized

   Number of GPU = 4. 
   Memory size per GPU = 12 GB

I even tried with

   batch_strategy: sequence
   max_features = 10000

Data size is not that big but still facing issues. Does anybody else also faced the same issue? How to force tensorflow to use all available GPU devices?

Ghostvv · November 15, 2019, 3:01pm

char n-gram cv featurizer blows up memory and hence OOM. We are working on sparse implementation for features. You can try the new version on updated-featurizers branch.

r4sn4 · November 18, 2019, 1:59pm

@Ghostvv I checked with this branch. Training time is significantly reduced. It took 20 mins on this data. When is the update with this feature is coming?

Ghostvv · November 19, 2019, 9:51am

I’m glad to hear that it is working for you. We’re currently working on finalizing it. Unfortunately, I cannot tell you the exact date when we release it.

r4sn4 · November 20, 2019, 10:28am

@Ghostvv I trained model on 1000 epochs, loss was > 5 and accuracy = 0.998, training took 3.5 hours. How do I reduce Loss ?

Ghostvv · November 20, 2019, 10:29am

if accuracy is that high, you don’t need to reduce the loss more

Topic		Replies	Views
Memory Error Rasa Open Source	3	1226	November 8, 2019
OOM error while training Rasa/LaBSE Rasa Open Source	1	862	October 10, 2022
Training fails showing error as 'OOM when allocating tensor with shape[93,217,217]' Rasa Open Source	4	2963	August 28, 2022
Rasa Is Consuming a lot of Memory Rasa Open Source	1	555	September 28, 2022
Segmentation fault (core dumped) on AWS GPU while training Rasa Open Source	0	1472	January 22, 2019

NLU training failing

Related topics