How to train with multi GPU and fix low volatile gpu util

anhquan075 · January 24, 2021, 1:16pm

Hello everyone, I want to train my model with multi GPU, my server has two RTX 2080Ti GPU but I just ran only one card and low volatile gpu util is very low. How can I fix this ?

Juste · January 25, 2021, 2:05pm

Hi @anhquan075. You should be able to specify the amount of GPUs you have and make sure that your server allocates enough memory for it. You can find more info on that here.

riqui · October 7, 2021, 7:12am

Hi, I tried with below conf setup. export TF_FORCE_GPU_ALLOW_GROWTH=True export TF_GPU_MEMORY_ALLOC=“0:5120”

training cmd: CUDA_VISIBLE_DEVICES=2,3 rasa train -c config.yml --data train.yml

It didn’t take device 3 at all. Now i reduced the memory size to 3GB for device 2 but exception was raised. why the training is not running in device 3. export TF_GPU_MEMORY_ALLOC=“0:3072” exception:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[128,266,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node rasa_sequence_layer_text/text_encoder/transformer_encoder_layer_1/randomly_connected_dense_11/mul_4 (defined at /lib/python3.6/site-packages/tensorflow_addons/activations/gelu.py:93) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[crf/cond/else/_1/crf/cond/concat/_264]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[128,266,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node rasa_sequence_layer_text/text_encoder/transformer_encoder_layer_1/randomly_connected_dense_11/mul_4 (defined at /lib/python3.6/site-packages/tensorflow_addons/activations/gelu.py:93) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_35596]

Function call stack: train_function → train_function

BarMin · June 23, 2022, 9:10am

Any progress here? It’s the same issues as Run NLU training on multiple GPUs.

Topic		Replies	Views
Training: How to use multiple GPUs for distributed training? Rasa Open Source	0	389	April 6, 2022
Run NLU training on multiple GPUs Rasa Open Source	6	1894	November 20, 2022
NLU training failing Rasa Open Source	5	1889	November 20, 2019
Can any one hep me how to use GPU for my training because I have GeForce RTX 2060 6GB and an i7 10th Gen 6-Core processor Rasa Open Source testing	1	739	December 2, 2020
Training fails showing error as 'OOM when allocating tensor with shape[93,217,217]' Rasa Open Source	4	2861	August 28, 2022

How to train with multi GPU and fix low volatile gpu util

Related topics