Run NLU training on multiple GPUs

jeanveau · June 1, 2022, 9:12am

Hello Rasa team,

I’m wondering is Rasa 3.1 officially support NLU training on multiple GPU ?

I have a VM with 4 X Tesla K80, I tried to run the NLU training on that VM in a docker container (tensorfolw:2.7.3-gpu) with rasa 3.1 installed,

according to the log, the 4 GPUs could be identified correctly, but only one of them is actually using by the training task.

2022-06-01 08:05:06.964337: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

2022-06-01 08:05:09.037665: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

2022-06-01 08:05:09.037751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10792 MB memory: -> device: 0, name: Tesla K80, pci bus id: 782d:00:00.0, compute capability: 3.7

2022-06-01 08:05:09.040616: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

2022-06-01 08:05:09.040665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 10792 MB memory: -> device: 1, name: Tesla K80, pci bus id: 9072:00:00.0, compute capability: 3.7

2022-06-01 08:05:09.041878: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

2022-06-01 08:05:09.041911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 10792 MB memory: -> device: 2, name: Tesla K80, pci bus id: a530:00:00.0, compute capability: 3.7

2022-06-01 08:05:09.043083: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

2022-06-01 08:05:09.043117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 10792 MB memory: -> device: 3, name: Tesla K80, pci bus id: b5f3:00:00.0, compute capability: 3.7

All model checkpoint layers were used when initializing TFBertModel.

Only GPU-1 is using:

I tried to enable/disable TF_GPU_MEMORY_ALLOC and TF_FORCE_GPU_ALLOW_GROWTH as suggested by Tuning Your NLU Model

Always the same, only one GPU is used.

Also, I read all similar topics in this forum and I found I’m not the only one having this issue, by I cannot find any solution in those topics.

Is rasa 3.1 supports multiple GPU ?

BarMin · June 21, 2022, 5:55pm

Thanks for this post I actually have the exact same issue. Looking forward for some feedback from RASA team. Did you managed to figure it out?

For me it’s always the same for different commands (for example like below):

CUDA_VISIBLE_DEVICES=0,1,2,3 rasa train
TF_GPU_MEMORY_ALLOC="0:2048, 1:2048, 2:2048, 3:2048" rasa train

As you can see in nvidia-smi, only one GPU is utilized and no speed-up can be observed


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000001:00:00.0 Off |                    0 |
| N/A   41C    P8    29W / 149W |  10456MiB / 11441MiB |     57%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000002:00:00.0 Off |                    0 |
| N/A   32C    P8    38W / 149W |    148MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000003:00:00.0 Off |                    0 |
| N/A   39C    P8    27W / 149W |    148MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000004:00:00.0 Off |                    0 |
| N/A   32C    P8    32W / 149W |    148MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I’m currently trying to modify internal RASA code to force MirroredStrategy() while training, but no luck yet.

Very disappointing, that such a basic feature for ML use-case (train on multiple-gpu) causes so much troubles.

jeanveau · June 29, 2022, 9:07am

NO lucky yet.

As far as I know, the DIETClassifier does not support multiple GPU.

As DIETClassifier is a Keras model, we may need to overwrite the train() func to make it compatible to multiGPU. you can find some examples here:

Did not have much time to dig into this recently.

BarMin · August 19, 2022, 12:30pm

Unfortunately, didn’t manage to rewrite this to make it work for now. Not as simple as I thought… Curious why RASA team is so silent about it Kind of major/critical bug, no to be able to run Keras training on multi-gpu.

Abo3bah · November 14, 2022, 10:36am

could you help me, i have gpu installed at my machine and i want to do the training on it what should i do ?

i have spent nearly one week searching and applying with no result

jeanveau · November 14, 2022, 5:10pm

If you just want to run your training with GPU, the you can use rasa’s official GPU docker image, or by using a tensorflow GPU image then install rasa yourself, you need to run a GPU image interactively like docker run -it <docker_gpu_image_name> bash, use volume to load training data in your host machine

Abo3bah · November 20, 2022, 9:16am

i have tried to follow up with this link here but when executing this command

docker run -it --gpus all -v $PWD:/tmp gcr.io/rasa-platform/rasa:3.0.8-full-gpu run

it takes me to authentication errors with google platform

i tried to search for this image on the internet but found nothing i do not know if there is a problem with this platform or not Google Cloud console

so can you help me with a tested version that you used and installed successfully then worked

as i can test it at my side following your steps

Topic		Replies	Views
Training: How to use multiple GPUs for distributed training? Rasa Open Source	0	391	April 6, 2022
Rasa core use multi CPU or GPU Rasa Open Source	2	2831	December 28, 2018
How to train with multi GPU and fix low volatile gpu util Rasa Open Source	3	1487	June 23, 2022
Rasa Assistent trained on CPU faster than on GPU Rasa Open Source	9	1590	January 25, 2023
RASA and GPU Getting Started with Rasa	8	1251	October 13, 2023

Run NLU training on multiple GPUs

Related topics