GPU over 2X slower than CPU

I just tried using a GPU for my training and it is a lot slower than CPU. I’m using a i7 8700 and a Nvidia 2070 gpu. The GPU is over 2X slower - and I’m just trying to figure out why. This happens both on NLU and on Core training.

I have the appropriate Tensorflow-gpu version, Cuda, and CUDDN installed.

While anaconda shows keras-gpu being available, ‘pip’ doesn’t seem to be able to find it… So I’ve only installed a GPU version of tensorflow.

core training shows -

CPU: 734/8734 [==============================] - 1s 110us/step - loss: 2.4849 - acc: 0.3525 Epoch 2/100 8734/8734 [==============================] - 1s 75us/step - loss: 1.7924 - acc: 0.4560 Epoch 3/100 8734/8734 [==============================] - 1s 75us/step - loss: 1.2614 - acc: 0.6503 Epoch 4/100 8734/8734 [==============================] - 1s 75us/step - loss: 0.8581 - acc: 0.7908 Epoch 5/100 8734/8734 [==============================] - 1s 75us/step - loss: 0.6043 - acc: 0.8576

GPU: Epoch 1/100 9114/9114 [==============================] - 2s 224us/step - loss: 2.4962 - acc: 0.3366 Epoch 2/100 9114/9114 [==============================] - 2s 185us/step - loss: 1.7185 - acc: 0.5086 Epoch 3/100 9114/9114 [==============================] - 2s 186us/step - loss: 1.2080 - acc: 0.6572 Epoch 4/100 9114/9114 [==============================] - 2s 185us/step - loss: 0.8227 - acc: 0.8024

I have verified that the GPU is being used by watching ‘nvtop’ - training uses about 39% of the GPU according to nvtop.

Any recommendations on what I can look into and/or try?

Is this for NLU or Core?

The timing shown is for core. But I also see NLU slower as well - perhaps not as much as Core, but still slower

Alright, so this is because LSTMs aren’t really optimised for GPUs. There is a GPU implementation for the LSTM in tensorflow, but we don’t have that in our repo at the moment. I don’t think it would speed up much on the amount of data you’re using anyways

@pax similar results (gpu vs pure cpu) :

CUDA: Training finished. NLU training took 2133.248753786087 s.

no gpu: Training finished. NLU training took 2188.424062728882 s.

Have you achieved any better speedup?

Best.

This is my testing on the same data set and same settings for NLU training:

pipeline:

  • name: “SpacyNLP”
  • name: “SpacyTokenizer”
  • name: “CountVectorsFeaturizer”
  • name: “EmbeddingIntentClassifier”

CPU time: 170 seconds GPU time: 110 seconds

So it speeds up by around 55%.

GPU info: 2019-06-04 09:29:57.918971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:01:00.0 totalMemory: 10.73GiB freeMemory: 10.23GiB 2019-06-04 09:29:57.919520: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:02:00.0 totalMemory: 10.73GiB freeMemory: 10.53GiB 2019-06-04 09:29:57.919630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1 2019-06-04 09:29:57.923027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-06-04 09:29:57.923035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2019-06-04 09:29:57.923038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y 2019-06-04 09:29:57.923040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N 2019-06-04 09:29:57.923238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9953 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5) 2019-06-04 09:29:57.923435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10247 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)

In short, a CPU has a very small number of cores, each of which can do different things and can handle very complex logic. A GPU has thousands of cores that operate in lockstep but can only handle simple logic. Therefore the overall processing throughput of a GPU can be massively higher. But moving logic from the CPU to the GPU isn’t easy.