Error while using Tensorflow GPU 1.14.0

Hello,

Anyone here could help me out with an error that I’m facing after upgrading rasa to 1.3.3.

The error is most likely for using TF-GPU, because I don’t get any errors when I uninstalled tf-gpu and run the ‘rasa train’ command…

I’ve re-installed TF-GPU 1.14.0, CUDA and cuDNN libraries.

CUDA - 10.0 cuDNN - 7.6.0 for CUDA 10.0 Python version - 3.7.4 Windows 10

2019-09-15 21:26:34 INFO     rasa.nlu.model  - Starting to train component WhitespaceTokenizer
2019-09-15 21:26:34 INFO     rasa.nlu.model  - Finished training component.
2019-09-15 21:26:34 INFO     rasa.nlu.model  - Starting to train component RegexFeaturizer
2019-09-15 21:27:01 INFO     rasa.nlu.model  - Finished training component.
2019-09-15 21:27:01 INFO     rasa.nlu.model  - Starting to train component CRFEntityExtractor
2019-09-15 21:27:12 INFO     rasa.nlu.model  - Finished training component.
2019-09-15 21:27:12 INFO     rasa.nlu.model  - Starting to train component EntitySynonymMapper
2019-09-15 21:27:12 INFO     rasa.nlu.model  - Finished training component.
2019-09-15 21:27:12 INFO     rasa.nlu.model  - Starting to train component CountVectorsFeaturizer
2019-09-15 21:27:12 INFO     rasa.nlu.model  - Finished training component.
2019-09-15 21:27:12 INFO     rasa.nlu.model  - Starting to train component CountVectorsFeaturizer
2019-09-15 21:27:13 INFO     rasa.nlu.model  - Finished training component.
2019-09-15 21:27:13 INFO     rasa.nlu.model  - Starting to train component EmbeddingIntentClassifier
2019-09-15 21:27:14.710581: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: failed to get device attribute 13 for device 0: CUDA_ERROR_UNKNOWN: unknown error

Tensorflow-GPU works for other ML related operations, it doesn’t throw any exceptions for multiple instances like it is throwing here.

Please help… Stuck on this for past 3-4 days now.

@xames3 are you sure tesnorflow-gpu is properly configured on your machine?

1 Like

Hello and thanks @akelad for looking into this issue.

I’m certainly sure that the tensorflow-gpu is configured correctly on my system. If I just run rasa train nlu --fixed-model-name <my-model-name> it generates the NLU model correctly without any hiccups using the GPU.

Similarly, if I run training for only the core model, it works too (TypeError: Object of type MaxHistoryTrackerFeaturizer is not JSON serializable is worked with a temporary solution provided here.).

But when I run rasa train --fixed-model-name <my-model-name>, the training starts correctly (core training works fine), nlu training works fine up to the process of training the EmbeddingIntentClassifier.

After EmbeddingIntentClassifier, it throws this error:
Attempting to fetch value instead of handling error Internal: failed to get device attribute 13 for device 0: CUDA_ERROR_UNKNOWN: unknown error

I checked regarding this error here and here but no fixes yet. Not sure if any other members are facing this issues as I haven’t seen anyone reporting about this neither in forums nor in github issues.

@Juste and @akelad, hope you could help me with this one…

1 Like

it looks like it cannot handle two different tf sessions (nlu and core) in one process

1 Like

Yep, that’s correct. Sorry @Ghostvv, I was not able to respond to your comment as I was out for a while. Glad to see a newer version of Rasa (1.3.6) is out.

Hope that fixes this bug. :crossed_fingers:

Nah… the bug/issue still persists. :sleepy::expressionless:

this bug seems to be TensorFlow problem, not rasa

1 Like

How much training data do you have? How much is cpu speed up?

we found some related GitHub issues: https://github.com/tensorflow/tensorflow/issues/28582 and Call tf.Session() twice causes fatal error: failed to get device attribute 13 for device 0 · Issue #31795 · tensorflow/tensorflow · GitHub

Yes, that’s correct. Rasa works flawlessly with CPU version of TF 1.14.0

Just a question if you don’t mind answering -

Do you guys use CPU or GPU version of TF for development also similarly, do you use Anaconda or normal Python interpreter for testing out, which is more preferable?

To be very honest @Ghostvv, it is not that much.

Maximum of 20 odd intents and only 2-3 intents have more than 100 examples rest of them have about roughly 50-60 examples.

CPU can handle that much of load as of now. Also by speed up, did you mean the clock speed? It’s an i7 7th gen processor with 2.8 GHz speeds.

we use cpu, because with the amount of data people usually have, and because our algorithms are not very deep, cpu doesn’t provide any faster training

Any solution to this problem?

We are building bot for enterprise product. With testing, we see a maximum 50 concurrent user making the Rasa very slow. As we are supporting standalone setup, we want Rasa to support more concurrent user so they need to create less Rasa nodes(manual effort).

So we tried tensorflow-gpu 1.14 and ended up with this error.

the error is related to cuda, did you install cuda drivers correctly?

To test if your GPU is correctly configured, try this python script:

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

If it returns 0, somethings wrong. Usually when this happens, the error message will give you a clue.

1 Like

Thank you for the input. Even I was thinking so, it should be tensorflow + gpu setup issue. The above code returns me 1 gpu device.

Below code also returns the GPU device details. May be I will try to reinstall(cuda +nvidia drivers) once again to see if it works.

tf.config.experimental.list_physical_devices(‘GPU’) 2020-02-06 20:23:22.831000: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library nvcuda.dll 2020-02-06 20:23:23.653171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: Quadro P2000 major: 6 minor: 1 memoryClockRate(GHz): 1.468 pciBusID: 0000:01:00.0 2020-02-06 20:23:23.688631: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. 2020-02-06 20:23:23.700465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0 [PhysicalDevice(name=’/physical_device:GPU:0’, device_type=‘GPU’)]