GPU over 2X slower than CPU

pax · March 4, 2019, 12:44am

I just tried using a GPU for my training and it is a lot slower than CPU. I’m using a i7 8700 and a Nvidia 2070 gpu. The GPU is over 2X slower - and I’m just trying to figure out why. This happens both on NLU and on Core training.

I have the appropriate Tensorflow-gpu version, Cuda, and CUDDN installed.

While anaconda shows keras-gpu being available, ‘pip’ doesn’t seem to be able to find it… So I’ve only installed a GPU version of tensorflow.

core training shows -

CPU: 734/8734 [==============================] - 1s 110us/step - loss: 2.4849 - acc: 0.3525 Epoch 2/100 8734/8734 [==============================] - 1s 75us/step - loss: 1.7924 - acc: 0.4560 Epoch 3/100 8734/8734 [==============================] - 1s 75us/step - loss: 1.2614 - acc: 0.6503 Epoch 4/100 8734/8734 [==============================] - 1s 75us/step - loss: 0.8581 - acc: 0.7908 Epoch 5/100 8734/8734 [==============================] - 1s 75us/step - loss: 0.6043 - acc: 0.8576

GPU: Epoch 1/100 9114/9114 [==============================] - 2s 224us/step - loss: 2.4962 - acc: 0.3366 Epoch 2/100 9114/9114 [==============================] - 2s 185us/step - loss: 1.7185 - acc: 0.5086 Epoch 3/100 9114/9114 [==============================] - 2s 186us/step - loss: 1.2080 - acc: 0.6572 Epoch 4/100 9114/9114 [==============================] - 2s 185us/step - loss: 0.8227 - acc: 0.8024

I have verified that the GPU is being used by watching ‘nvtop’ - training uses about 39% of the GPU according to nvtop.

Any recommendations on what I can look into and/or try?

akelad · March 14, 2019, 1:41pm

Is this for NLU or Core?

pax · March 14, 2019, 1:56pm

The timing shown is for core. But I also see NLU slower as well - perhaps not as much as Core, but still slower

akelad · March 14, 2019, 2:56pm

Alright, so this is because LSTMs aren’t really optimised for GPUs. There is a GPU implementation for the LSTM in tensorflow, but we don’t have that in our repo at the moment. I don’t think it would speed up much on the amount of data you’re using anyways

phojnacki · April 29, 2019, 10:19am

@pax similar results (gpu vs pure cpu) :

CUDA: Training finished. NLU training took 2133.248753786087 s.

no gpu: Training finished. NLU training took 2188.424062728882 s.

Have you achieved any better speedup?

Best.

lingvisa · June 4, 2019, 4:38pm

This is my testing on the same data set and same settings for NLU training:

pipeline:

name: “SpacyNLP”
name: “SpacyTokenizer”
name: “CountVectorsFeaturizer”
name: “EmbeddingIntentClassifier”

CPU time: 170 seconds GPU time: 110 seconds

So it speeds up by around 55%.

GPU info: 2019-06-04 09:29:57.918971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:01:00.0 totalMemory: 10.73GiB freeMemory: 10.23GiB 2019-06-04 09:29:57.919520: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:02:00.0 totalMemory: 10.73GiB freeMemory: 10.53GiB 2019-06-04 09:29:57.919630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1 2019-06-04 09:29:57.923027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-06-04 09:29:57.923035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2019-06-04 09:29:57.923038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y 2019-06-04 09:29:57.923040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N 2019-06-04 09:29:57.923238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9953 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5) 2019-06-04 09:29:57.923435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10247 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)

trevor_4079 · October 17, 2022, 9:12am

In short, a CPU has a very small number of cores, each of which can do different things and can handle very complex logic. A GPU has thousands of cores that operate in lockstep but can only handle simple logic. Therefore the overall processing throughput of a GPU can be massively higher. But moving logic from the CPU to the GPU isn’t easy.

Topic		Replies	Views
Rasa Assistent trained on GPU slower than on CPU Rasa Open Source	0	379	January 18, 2022
Rasa Assistent trained on CPU faster than on GPU Rasa Open Source	9	1590	January 25, 2023
Rasa train (rasa 1.9.x \| TensorFlow 2) on GPU? Rasa Open Source	6	4020	July 6, 2022
How can I check that Rasa using my GPU? GPU core just get to 46% Getting Started with Rasa	1	426	April 4, 2020
Should we install GPU for training? Rasa Open Source	15	6795	May 18, 2021

GPU over 2X slower than CPU

Related topics