Hello.
I use quick-install to deploy Rasa X on-premises. I had to remove the cluster and reinstall it again. But now, when I upload a model, it appears on the models page, but disappears when I leave it. I can’t talk to the bot.
If I try to train or upload a second model, it fails, and the rasa-production
pod enters a CrashLoopBackOff
, for apparent reason:
2021-03-26 18:17:37.049277: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-03-26 18:17:37.049317: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-03-26 18:17:41.499110: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-03-26 18:17:41.499176: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2021-03-26 18:17:41.499219: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (rasa-rasa-production-994dfcf88-cx2r7): /proc/driver/nvidia/version does not exist
2021-03-26 18:18:17.000696: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-26 18:18:17.047030: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2399995000 Hz
2021-03-26 18:18:17.049631: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5074750 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-26 18:18:17.049663: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
followed by a CrashLoopBackOff
for the rasa-worker
pod:
2021-03-26 18:24:31.155515: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-03-26 18:24:31.155601: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-03-26 18:24:36.003054: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-03-26 18:24:36.003132: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2021-03-26 18:24:36.003162: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (rasa-rasa-worker-686f7cd875-cltnk): /proc/driver/nvidia/version does not exist
2021-03-26 18:25:14.909385: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-26 18:25:14.918618: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2399995000 Hz
2021-03-26 18:25:14.919102: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x483bf60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-26 18:25:14.919131: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-03-26 18:25:32.592331: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 28073984 exceeds 10% of free system memory.
2021-03-26 18:25:32.622427: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 28073984 exceeds 10% of free system memory.
2021-03-26 18:25:32.633460: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 28073984 exceeds 10% of free system memory.
2021-03-26 18:25:32.672083: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 28000256 exceeds 10% of free system memory.
2021-03-26 18:25:32.694705: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 28000256 exceeds 10% of free system memory.
Now with the second logs, there is clearly something wrong. But I had no problems uploading models before reinstalling Rasa X.
The server also becomes really slow in general. When I type into the terminal, it’s showing up after 10 seconds or more.
This is on versions 0.38.1 as well as 0.35.0. I have not tried other versions.
Result of vmstat -s
when not doing anything:
8119936 K total memory
3330988 K used memory
1634264 K active memory
1987712 K inactive memory
3043400 K free memory
29820 K buffer memory
1715728 K swap cache
2097148 K total swap
2096996 K used swap
152 K free swap
Result of vmstat -s
when uploading a model:
8119936 K total memory
7111748 K used memory
4154472 K active memory
2201856 K inactive memory
129808 K free memory
12540 K buffer memory
865840 K swap cache
2097148 K total swap
2097148 K used swap
0 K free swap
Is this enough? Because here the cluster requirements add up to over 9 GB of RAM while here it says it needs a minimum of 4 GB.
Update:
If I upload small models of ~30 MB, it works fine. But if I upload a 100 MB model after it, I’ll need to delete and deploy the whole thing again. There’s a “limit” of about 150 MB.
If I manage to upload a 100 MB model and activate it, and it doesn’t disappear, as soon as I talk to the bot, the rasa-worker
pod crashes (Error
):
2021-03-26 21:46:12 WARNING rasa.core.tracker_store - (psycopg2.OperationalError) FATAL: sorry, too many clients already
(Background on this error at: http://sqlalche.me/e/13/e3q8)
2021-03-26 21:46:17 WARNING rasa.core.tracker_store - (psycopg2.OperationalError) FATAL: sorry, too many clients already
(Background on this error at: http://sqlalche.me/e/13/e3q8)
2021-03-26 21:46:22 WARNING rasa.core.tracker_store - (psycopg2.OperationalError) FATAL: sorry, too many clients already
(Background on this error at: http://sqlalche.me/e/13/e3q8)
2021-03-26 21:46:28 WARNING rasa.core.tracker_store - (psycopg2.OperationalError) FATAL: sorry, too many clients already
(Background on this error at: http://sqlalche.me/e/13/e3q8)
2021-03-26 21:46:33 WARNING rasa.core.tracker_store - (psycopg2.OperationalError) FATAL: sorry, too many clients already
(Background on this error at: http://sqlalche.me/e/13/e3q8)
Too many clients? I’m the only one using it.