Rasa X not saving models and crashing

ChrisRahme · March 26, 2021, 6:31pm

Hello.

I use quick-install to deploy Rasa X on-premises. I had to remove the cluster and reinstall it again. But now, when I upload a model, it appears on the models page, but disappears when I leave it. I can’t talk to the bot.

If I try to train or upload a second model, it fails, and the rasa-production pod enters a CrashLoopBackOff, for apparent reason:

2021-03-26 18:17:37.049277: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-03-26 18:17:37.049317: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-03-26 18:17:41.499110: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-03-26 18:17:41.499176: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2021-03-26 18:17:41.499219: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (rasa-rasa-production-994dfcf88-cx2r7): /proc/driver/nvidia/version does not exist
2021-03-26 18:18:17.000696: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-26 18:18:17.047030: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2399995000 Hz
2021-03-26 18:18:17.049631: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5074750 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-26 18:18:17.049663: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version

followed by a CrashLoopBackOff for the rasa-worker pod:

2021-03-26 18:24:31.155515: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-03-26 18:24:31.155601: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-03-26 18:24:36.003054: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-03-26 18:24:36.003132: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2021-03-26 18:24:36.003162: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (rasa-rasa-worker-686f7cd875-cltnk): /proc/driver/nvidia/version does not exist
2021-03-26 18:25:14.909385: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-26 18:25:14.918618: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2399995000 Hz
2021-03-26 18:25:14.919102: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x483bf60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-26 18:25:14.919131: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-03-26 18:25:32.592331: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 28073984 exceeds 10% of free system memory.
2021-03-26 18:25:32.622427: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 28073984 exceeds 10% of free system memory.
2021-03-26 18:25:32.633460: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 28073984 exceeds 10% of free system memory.
2021-03-26 18:25:32.672083: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 28000256 exceeds 10% of free system memory.
2021-03-26 18:25:32.694705: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 28000256 exceeds 10% of free system memory.

Now with the second logs, there is clearly something wrong. But I had no problems uploading models before reinstalling Rasa X.

The server also becomes really slow in general. When I type into the terminal, it’s showing up after 10 seconds or more.

This is on versions 0.38.1 as well as 0.35.0. I have not tried other versions.

Result of vmstat -s when not doing anything:

      8119936 K total memory
      3330988 K used memory
      1634264 K active memory
      1987712 K inactive memory
      3043400 K free memory
        29820 K buffer memory
      1715728 K swap cache
      2097148 K total swap
      2096996 K used swap
          152 K free swap

Result of vmstat -s when uploading a model:

      8119936 K total memory
      7111748 K used memory
      4154472 K active memory
      2201856 K inactive memory
       129808 K free memory
        12540 K buffer memory
       865840 K swap cache
      2097148 K total swap
      2097148 K used swap
            0 K free swap

Is this enough? Because here the cluster requirements add up to over 9 GB of RAM while here it says it needs a minimum of 4 GB.

Update:

If I upload small models of ~30 MB, it works fine. But if I upload a 100 MB model after it, I’ll need to delete and deploy the whole thing again. There’s a “limit” of about 150 MB.

If I manage to upload a 100 MB model and activate it, and it doesn’t disappear, as soon as I talk to the bot, the rasa-worker pod crashes (Error):

2021-03-26 21:46:12 WARNING  rasa.core.tracker_store  - (psycopg2.OperationalError) FATAL:  sorry, too many clients already

(Background on this error at: http://sqlalche.me/e/13/e3q8)
2021-03-26 21:46:17 WARNING  rasa.core.tracker_store  - (psycopg2.OperationalError) FATAL:  sorry, too many clients already

(Background on this error at: http://sqlalche.me/e/13/e3q8)
2021-03-26 21:46:22 WARNING  rasa.core.tracker_store  - (psycopg2.OperationalError) FATAL:  sorry, too many clients already

(Background on this error at: http://sqlalche.me/e/13/e3q8)
2021-03-26 21:46:28 WARNING  rasa.core.tracker_store  - (psycopg2.OperationalError) FATAL:  sorry, too many clients already

(Background on this error at: http://sqlalche.me/e/13/e3q8)
2021-03-26 21:46:33 WARNING  rasa.core.tracker_store  - (psycopg2.OperationalError) FATAL:  sorry, too many clients already

(Background on this error at: http://sqlalche.me/e/13/e3q8)

Too many clients? I’m the only one using it.

mloubser · April 12, 2021, 11:31am

The number of clients allowed is configured on the SQL database itself - it should be set to at least 60, since it can hit up to 50 + any manual/other connections

mloubser · April 12, 2021, 11:32am

How have you defined prod/worker resource limits in your values? You should follow the guidelines from the method you used - your first link is to a helm chart installation, the second to a quick-installation. If you’re going in to prod, you should use the helm charts resource recommendations as the guide.

ChrisRahme · April 12, 2021, 11:53am

Thanks a lot for the reply. I have since completely reset the server and everything works fine.

The number of clients allowed is configured on the SQL database itself - it should be set to at least 60, since it can hit up to 50 + any manual/other connections

Weird thing is I was the only one using it, how does it say there are too many clients?

Anyway, how do I do that? The chatbot is needed to talk to hundreds of clients at once.

How have you defined prod/worker resource limits in your values? You should follow the guidelines from the method you used - your first link is to a helm chart installation, the second to a quick-installation. If you’re going in to prod, you should use the helm charts resource recommendations as the guide.

I used the quick-install method, and have not modified the values. I can still do helm upgrade, so it’s the same, right?

Even though I passed the Advanced Deployment Workshop, I’m not really confident in my Helm/Kubernetes skills. Is there a way to get the values.yml used there so I can compare and build my own?

mloubser · April 12, 2021, 12:52pm

Ah, these are SQL clients, not per user, so don’t worry about the hundreds. This is more likely to be a thing if you’re using some external database, not the default one. If you’re using quick-install, then yeah, it would be weird if that got reset

Arjaan · April 12, 2021, 1:45pm

@ChrisRahme , This is the values.yml that was used during the advanced deployment workshop.

# debugMode enables / disables the debug mode for Rasa and Rasa X
debugMode: true
# custom action server
app:
    # from microk8s build-in registry
    name: "localhost:32000/deployment-workshop-bot-2-action-server"
    tag: "0.0.1"
nginx:
  service:
    # connect LoadBalancer directly to VMs' internal IP
    # You get this value with: $ hostname -I
    externalIPs: [10.150.0.8]
# rasax specific settings
rasax:
    # initialUser is the user which is created upon the initial start of Rasa X
    initialUser:
        # password for the Rasa X user
        password: "workshop"
    # passwordSalt Rasa X uses to salt the user passwords
    passwordSalt: "<safe credential>"
    # token Rasa X accepts as authentication token from other Rasa services
    token: "<safe credential>"
    # jwtSecret which is used to sign the jwtTokens of the users
    jwtSecret: "<safe credential>"
    # tag refers to the Rasa X image tag
    tag: "0.32.2"
# rasa: Settings common for all Rasa containers
rasa:
    # token Rasa accepts as authentication token from other Rasa services
    token: "<safe credential>"
    # tag refers to the Rasa image tag
    tag: "1.10.14-full"
    versions:
        # rasaProduction is the container which serves the production environment
        rasaProduction:
            # replicaCount of the Rasa Production container
            replicaCount: 1
# RabbitMQ specific settings
rabbitmq:
    # rabbitmq settings of the subchart
    rabbitmq:
        # password which is used for the authentication
        password: "<safe credential>"
# global settings of the used subcharts
global:
    # postgresql: global settings of the postgresql subchart
    postgresql:
        # postgresqlPassword is the password which is used when the postgresqlUsername equals "postgres"
        postgresqlPassword: "<safe credential>"
    # redis: global settings of the postgresql subchart
    redis:
        # password to use in case there no external secret was provided
        password: "<safe credential>"

ChrisRahme · April 12, 2021, 6:53pm

Oh yeah, I’m using an external MySQL database to store conversations (defined in endpoints.yml). I’ll check out how to change the number of clients in case it happens again.

Thanks for the help!

ChrisRahme · April 12, 2021, 6:53pm

Thanks a lot @Arjaan!

Topic		Replies	Views
Uploading model from files failed [Deprecated] Rasa X Community Edition	1	569	February 3, 2021
Rasa X - Training Failed / Upload Failed [Deprecated] Rasa X Community Edition	4	2240	December 1, 2021
Rasa troubleshooting as pod getting crashLoopbackoff Rasa Open Source	1	283	April 28, 2021
Rasa 2 Form example won't work Rasa Open Source	4	925	January 20, 2021
Solved: Can't start Rasa X [Deprecated] Rasa X Community Edition	8	1489	July 3, 2021

Rasa X not saving models and crashing

Related topics