I’m wondering is Rasa 3.1 officially support NLU training on multiple GPU ?
I have a VM with 4 X Tesla K80, I tried to run the NLU training on that VM in a docker container (tensorfolw:2.7.3-gpu) with rasa 3.1 installed,
according to the log, the 4 GPUs could be identified correctly, but only one of them is actually using by the training task.
2022-06-01 08:05:06.964337: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-01 08:05:09.037665: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2022-06-01 08:05:09.037751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10792 MB memory: -> device: 0, name: Tesla K80, pci bus id: 782d:00:00.0, compute capability: 3.7
2022-06-01 08:05:09.040616: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2022-06-01 08:05:09.040665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 10792 MB memory: -> device: 1, name: Tesla K80, pci bus id: 9072:00:00.0, compute capability: 3.7
2022-06-01 08:05:09.041878: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2022-06-01 08:05:09.041911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 10792 MB memory: -> device: 2, name: Tesla K80, pci bus id: a530:00:00.0, compute capability: 3.7
2022-06-01 08:05:09.043083: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2022-06-01 08:05:09.043117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 10792 MB memory: -> device: 3, name: Tesla K80, pci bus id: b5f3:00:00.0, compute capability: 3.7
All model checkpoint layers were used when initializing TFBertModel.
Unfortunately, didn’t manage to rewrite this to make it work for now. Not as simple as I thought… Curious why RASA team is so silent about it Kind of major/critical bug, no to be able to run Keras training on multi-gpu.
If you just want to run your training with GPU, the you can use rasa’s official GPU docker image, or by using a tensorflow GPU image then install rasa yourself, you need to run a GPU image interactively like docker run -it <docker_gpu_image_name> bash, use volume to load training data in your host machine