How to train with multi GPU and fix low volatile gpu util

Hello everyone, I want to train my model with multi GPU, my server has two RTX 2080Ti GPU image but I just ran only one card and low volatile gpu util is very low. How can I fix this ?

Hi @anhquan075. You should be able to specify the amount of GPUs you have and make sure that your server allocates enough memory for it. You can find more info on that here.

Hi, I tried with below conf setup. export TF_FORCE_GPU_ALLOW_GROWTH=True export TF_GPU_MEMORY_ALLOC=β€œ0:5120”

training cmd: CUDA_VISIBLE_DEVICES=2,3 rasa train -c config.yml --data train.yml

It didn’t take device 3 at all. Now i reduced the memory size to 3GB for device 2 but exception was raised. why the training is not running in device 3. export TF_GPU_MEMORY_ALLOC=β€œ0:3072” exception:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[128,266,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node rasa_sequence_layer_text/text_encoder/transformer_encoder_layer_1/randomly_connected_dense_11/mul_4 (defined at /lib/python3.6/site-packages/tensorflow_addons/activations/gelu.py:93) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[crf/cond/else/_1/crf/cond/concat/_264]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[128,266,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node rasa_sequence_layer_text/text_encoder/transformer_encoder_layer_1/randomly_connected_dense_11/mul_4 (defined at /lib/python3.6/site-packages/tensorflow_addons/activations/gelu.py:93) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_35596]

Function call stack: train_function β†’ train_function

Any progress here? It’s the same issues as Run NLU training on multiple GPUs.

1 Like