Run NLU training on multiple GPUs

Hello Rasa team,

I’m wondering is Rasa 3.1 officially support NLU training on multiple GPU ?

I have a VM with 4 X Tesla K80, I tried to run the NLU training on that VM in a docker container (tensorfolw:2.7.3-gpu) with rasa 3.1 installed,

according to the log, the 4 GPUs could be identified correctly, but only one of them is actually using by the training task.

2022-06-01 08:05:06.964337: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

2022-06-01 08:05:09.037665: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

2022-06-01 08:05:09.037751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10792 MB memory: -> device: 0, name: Tesla K80, pci bus id: 782d:00:00.0, compute capability: 3.7

2022-06-01 08:05:09.040616: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

2022-06-01 08:05:09.040665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 10792 MB memory: -> device: 1, name: Tesla K80, pci bus id: 9072:00:00.0, compute capability: 3.7

2022-06-01 08:05:09.041878: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

2022-06-01 08:05:09.041911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 10792 MB memory: -> device: 2, name: Tesla K80, pci bus id: a530:00:00.0, compute capability: 3.7

2022-06-01 08:05:09.043083: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

2022-06-01 08:05:09.043117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 10792 MB memory: -> device: 3, name: Tesla K80, pci bus id: b5f3:00:00.0, compute capability: 3.7

All model checkpoint layers were used when initializing TFBertModel.

Only GPU-1 is using:

I tried to enable/disable TF_GPU_MEMORY_ALLOC and TF_FORCE_GPU_ALLOW_GROWTH as suggested by Tuning Your NLU Model

Always the same, only one GPU is used.

Also, I read all similar topics in this forum and I found I’m not the only one having this issue, by I cannot find any solution in those topics.

Is rasa 3.1 supports multiple GPU ?

2 Likes

Thanks for this post :wink: I actually have the exact same issue. Looking forward for some feedback from RASA team. Did you managed to figure it out?

For me it’s always the same for different commands (for example like below):

CUDA_VISIBLE_DEVICES=0,1,2,3 rasa train
TF_GPU_MEMORY_ALLOC="0:2048, 1:2048, 2:2048, 3:2048" rasa train

As you can see in nvidia-smi, only one GPU is utilized and no speed-up can be observed :confused:


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000001:00:00.0 Off |                    0 |
| N/A   41C    P8    29W / 149W |  10456MiB / 11441MiB |     57%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000002:00:00.0 Off |                    0 |
| N/A   32C    P8    38W / 149W |    148MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000003:00:00.0 Off |                    0 |
| N/A   39C    P8    27W / 149W |    148MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000004:00:00.0 Off |                    0 |
| N/A   32C    P8    32W / 149W |    148MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I’m currently trying to modify internal RASA code to force MirroredStrategy() while training, but no luck yet.

Very disappointing, that such a basic feature for ML use-case (train on multiple-gpu) causes so much troubles.

1 Like

NO lucky yet.

As far as I know, the DIETClassifier does not support multiple GPU.

As DIETClassifier is a Keras model, we may need to overwrite the train() func to make it compatible to multiGPU. you can find some examples here:

Did not have much time to dig into this recently.