pax
(Garry Paxinos)
March 4, 2019, 12:44am
1
I just tried using a GPU for my training and it is a lot slower than CPU. I’m using a i7 8700 and a Nvidia 2070 gpu. The GPU is over 2X slower - and I’m just trying to figure out why. This happens both on NLU and on Core training.
I have the appropriate Tensorflow-gpu version, Cuda, and CUDDN installed.
While anaconda shows keras-gpu being available, ‘pip’ doesn’t seem to be able to find it… So I’ve only installed a GPU version of tensorflow.
core training shows -
CPU:
734/8734 [==============================] - 1s 110us/step - loss: 2.4849 - acc: 0.3525
Epoch 2/100
8734/8734 [==============================] - 1s 75us/step - loss: 1.7924 - acc: 0.4560
Epoch 3/100
8734/8734 [==============================] - 1s 75us/step - loss: 1.2614 - acc: 0.6503
Epoch 4/100
8734/8734 [==============================] - 1s 75us/step - loss: 0.8581 - acc: 0.7908
Epoch 5/100
8734/8734 [==============================] - 1s 75us/step - loss: 0.6043 - acc: 0.8576
GPU:
Epoch 1/100
9114/9114 [==============================] - 2s 224us/step - loss: 2.4962 - acc: 0.3366
Epoch 2/100
9114/9114 [==============================] - 2s 185us/step - loss: 1.7185 - acc: 0.5086
Epoch 3/100
9114/9114 [==============================] - 2s 186us/step - loss: 1.2080 - acc: 0.6572
Epoch 4/100
9114/9114 [==============================] - 2s 185us/step - loss: 0.8227 - acc: 0.8024
I have verified that the GPU is being used by watching ‘nvtop’ - training uses about 39% of the GPU according to nvtop.
Any recommendations on what I can look into and/or try?
pax
(Garry Paxinos)
March 14, 2019, 1:56pm
3
The timing shown is for core. But I also see NLU slower as well - perhaps not as much as Core, but still slower
akelad
(Akela Drissner)
March 14, 2019, 2:56pm
4
Alright, so this is because LSTMs aren’t really optimised for GPUs. There is a GPU implementation for the LSTM in tensorflow, but we don’t have that in our repo at the moment. I don’t think it would speed up much on the amount of data you’re using anyways
phojnacki
(Przemysław Hojnacki)
April 29, 2019, 10:19am
5
@pax similar results (gpu vs pure cpu) :
CUDA:
Training finished. NLU training took 2133.248753786087 s.
no gpu:
Training finished. NLU training took 2188.424062728882 s.
Have you achieved any better speedup?
Best.
lingvisa
(Lingvisa)
June 4, 2019, 4:38pm
6
This is my testing on the same data set and same settings for NLU training:
pipeline:
name: “SpacyNLP”
name: “SpacyTokenizer”
name: “CountVectorsFeaturizer”
name: “EmbeddingIntentClassifier”
CPU time: 170 seconds
GPU time: 110 seconds
So it speeds up by around 55%.
GPU info:
2019-06-04 09:29:57.918971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:01:00.0
totalMemory: 10.73GiB freeMemory: 10.23GiB
2019-06-04 09:29:57.919520: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:02:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-06-04 09:29:57.919630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2019-06-04 09:29:57.923027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-04 09:29:57.923035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1
2019-06-04 09:29:57.923038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y
2019-06-04 09:29:57.923040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N
2019-06-04 09:29:57.923238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9953 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2019-06-04 09:29:57.923435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10247 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)
In short, a CPU has a very small number of cores, each of which can do different things and can handle very complex logic. A GPU has thousands of cores that operate in lockstep but can only handle simple logic. Therefore the overall processing throughput of a GPU can be massively higher. But moving logic from the CPU to the GPU isn’t easy.