Run NLU training on multiple GPUs

Hello Rasa team,

I’m wondering is Rasa 3.1 officially support NLU training on multiple GPU ?

I have a VM with 4 X Tesla K80, I tried to run the NLU training on that VM in a docker container (tensorfolw:2.7.3-gpu) with rasa 3.1 installed,

according to the log, the 4 GPUs could be identified correctly, but only one of them is actually using by the training task.

2022-06-01 08:05:06.964337: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

2022-06-01 08:05:09.037665: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

2022-06-01 08:05:09.037751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10792 MB memory: -> device: 0, name: Tesla K80, pci bus id: 782d:00:00.0, compute capability: 3.7

2022-06-01 08:05:09.040616: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

2022-06-01 08:05:09.040665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 10792 MB memory: -> device: 1, name: Tesla K80, pci bus id: 9072:00:00.0, compute capability: 3.7

2022-06-01 08:05:09.041878: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

2022-06-01 08:05:09.041911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 10792 MB memory: -> device: 2, name: Tesla K80, pci bus id: a530:00:00.0, compute capability: 3.7

2022-06-01 08:05:09.043083: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

2022-06-01 08:05:09.043117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 10792 MB memory: -> device: 3, name: Tesla K80, pci bus id: b5f3:00:00.0, compute capability: 3.7

All model checkpoint layers were used when initializing TFBertModel.

Only GPU-1 is using:

I tried to enable/disable TF_GPU_MEMORY_ALLOC and TF_FORCE_GPU_ALLOW_GROWTH as suggested by Tuning Your NLU Model

Always the same, only one GPU is used.

Also, I read all similar topics in this forum and I found I’m not the only one having this issue, by I cannot find any solution in those topics.

Is rasa 3.1 supports multiple GPU ?

2 Likes

Thanks for this post :wink: I actually have the exact same issue. Looking forward for some feedback from RASA team. Did you managed to figure it out?

For me it’s always the same for different commands (for example like below):

CUDA_VISIBLE_DEVICES=0,1,2,3 rasa train
TF_GPU_MEMORY_ALLOC="0:2048, 1:2048, 2:2048, 3:2048" rasa train

As you can see in nvidia-smi, only one GPU is utilized and no speed-up can be observed :confused:


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000001:00:00.0 Off |                    0 |
| N/A   41C    P8    29W / 149W |  10456MiB / 11441MiB |     57%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000002:00:00.0 Off |                    0 |
| N/A   32C    P8    38W / 149W |    148MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000003:00:00.0 Off |                    0 |
| N/A   39C    P8    27W / 149W |    148MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000004:00:00.0 Off |                    0 |
| N/A   32C    P8    32W / 149W |    148MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I’m currently trying to modify internal RASA code to force MirroredStrategy() while training, but no luck yet.

Very disappointing, that such a basic feature for ML use-case (train on multiple-gpu) causes so much troubles.

1 Like

NO lucky yet.

As far as I know, the DIETClassifier does not support multiple GPU.

As DIETClassifier is a Keras model, we may need to overwrite the train() func to make it compatible to multiGPU. you can find some examples here:

Did not have much time to dig into this recently.

Unfortunately, didn’t manage to rewrite this to make it work for now. Not as simple as I thought… Curious why RASA team is so silent about it :sweat_smile: Kind of major/critical bug, no to be able to run Keras training on multi-gpu.

could you help me, i have gpu installed at my machine and i want to do the training on it what should i do ?

i have spent nearly one week searching and applying with no result :pensive: :broken_heart:

If you just want to run your training with GPU, the you can use rasa’s official GPU docker image, or by using a tensorflow GPU image then install rasa yourself, you need to run a GPU image interactively like docker run -it <docker_gpu_image_name> bash, use volume to load training data in your host machine

i have tried to follow up with this link here but when executing this command

docker run -it --gpus all -v $PWD:/tmp gcr.io/rasa-platform/rasa:3.0.8-full-gpu run 

it takes me to authentication errors with google platform

i tried to search for this image on the internet but found nothing i do not know if there is a problem with this platform or not Google Cloud console

so can you help me with a tested version that you used and installed successfully then worked

as i can test it at my side following your steps