Training and model upload failed in Rasa-X

Version Rasa - 2.8.15 Rasa-sdk - 2.8.3 Rasa-X - latest

Please provide more info:

  • Does training work if you do rasa train?
  • What’s your method of deployment?

@ChrisRahme yes it’s working with rasa train.

I have deployed rasa-x to the AWS EKS Cluster via the Helm chart.

I also connected this to git repository but training data was not visible in rasa-x. So I tried to upload it from local machine but I am not able to upload or train the model.

Okay, please check the logs of the rasa-worker pod after training using kubectl logs <pod_name>.

@ChrisRahme My training data is synced with git repository and it is successfully trained on local machine but training is getting failed in rasa-x. Logs of rasa-worker pod is empty.

PS C:\WINDOWS\system32> kubectl --namespace rasahr logs rasa-x-1641746175-rasa-worker-6dcc468fd-f5kqm
PS C:\WINDOWS\system32>

@ChrisRahme should I check logs of any other pods ?

PS C:\WINDOWS\system32> kubectl --namespace rasahr get pods
NAME                                              READY   STATUS    RESTARTS   AGE
rasa-x-1641746175-app-7cc446595b-dwbfp            1/1     Running   0          17h
rasa-x-1641746175-db-migration-service-0          1/1     Running   0          17h
rasa-x-1641746175-duckling-67b647db54-wxn66       1/1     Running   0          17h
rasa-x-1641746175-event-service-9844b68db-vvjv8   1/1     Running   0          17h
rasa-x-1641746175-nginx-ccc677589-mrhhh           1/1     Running   0          17h
rasa-x-1641746175-postgresql-0                    1/1     Running   0          17h
rasa-x-1641746175-rabbit-0                        1/1     Running   0          17h
rasa-x-1641746175-rasa-worker-6dcc468fd-f5kqm     1/1     Running   0          17h
rasa-x-1641746175-rasa-x-5c85fbbd49-nwxvw         1/1     Running   0          17h
rasa-x-1641746175-redis-master-0                  1/1     Running   0          17h

@abhishekrathi please check your internet connection, if its speed is low then training can be failed or if it’s not stable then also. I have seen all the conversation between you and Chris and for me everything is seems fines as all your pods are running fine and you also strongly believe that everything as per the documentation. Good Luck!

1 Like

Thanks for the quick response but I have tried with a different connection and the internet speed is also good. Everything went perfect till connection to git repository but at the end stuck at model training. Not able to find the issue :sleepy:

@abhishekrathi check your internet speed or are you using VM ?

Internet download speed is 109 mbps and upload speed is 36 mbps.

Not using VM. I deployed rasa-x to the AWS with helm chart following the below process

  1. Requirements (AWS Requirements)
  2. Installation (AWS Installation)
  3. Installation (Helm Chart Installation)

“Training failed” message appears immediately on click of train the model. Not even processing for 1 second.

List of services running on amazonaws

PS C:\WINDOWS\system32> kubectl --namespace rasahr get service
NAME                                              TYPE           CLUSTER-IP       EXTERNAL-IP                                                                PORT(S)                                 AGE
rasa-x-1641746175-app                             ClusterIP      10.100.182.236   <none>                                                                     5055/TCP,80/TCP                         21h
rasa-x-1641746175-db-migration-service-headless   ClusterIP      None             <none>                                                                     8000/TCP                                21h
rasa-x-1641746175-duckling                        ClusterIP      10.100.84.56     <none>                                                                     8000/TCP                                21h
rasa-x-1641746175-nginx                           LoadBalancer   10.100.15.95     a7a102abe80f24583b6247273bedc2a4-1408708577.ap-south-1.elb.amazonaws.com   8000:32128/TCP                          21h
rasa-x-1641746175-postgresql                      ClusterIP      10.100.231.180   <none>                                                                     5432/TCP                                21h
rasa-x-1641746175-postgresql-headless             ClusterIP      None             <none>                                                                     5432/TCP                                21h
rasa-x-1641746175-rabbit                          ClusterIP      10.100.136.197   <none>                                                                     4369/TCP,5672/TCP,25672/TCP,15672/TCP   21h
rasa-x-1641746175-rabbit-headless                 ClusterIP      None             <none>                                                                     4369/TCP,5672/TCP,25672/TCP,15672/TCP   21h
rasa-x-1641746175-rasa-worker                     ClusterIP      10.100.113.220   <none>                                                                     5005/TCP                                21h
rasa-x-1641746175-rasa-x                          ClusterIP      10.100.235.253   <none>                                                                     5002/TCP                                21h
rasa-x-1641746175-redis-headless                  ClusterIP      None             <none>                                                                     6379/TCP                                21h
rasa-x-1641746175-redis-master                    ClusterIP      10.100.179.33    <none>                                                                     6379/TCP

For now just the worker :slight_smile: And do it right after training fails.

If nothing shows try adding --previous to the command.

PS C:\WINDOWS\system32> kubectl --namespace rasahr logs rasa-x-1641746175-rasa-worker-6dcc468fd-f5kqm --previous
Error from server (BadRequest): previous terminated container "rasa-x" in pod "rasa-x-1641746175-rasa-worker-6dcc468fd-f5kqm" not found

with describe pods

PS C:\WINDOWS\system32> kubectl --namespace rasahr describe pods rasa-x-1641746175-rasa-worker-6dcc468fd-f5kqm
Name:         rasa-x-1641746175-rasa-worker-6dcc468fd-f5kqm
Namespace:    rasahr
Priority:     0
Node:         ip-192-168-19-85.ap-south-1.compute.internal/192.168.19.85
Start Time:   Sun, 09 Jan 2022 22:06:17 +0530
Labels:       app.kubernetes.io/component=rasa-worker
              app.kubernetes.io/instance=rasa-x-1641746175
              app.kubernetes.io/name=rasa-x
              pod-template-hash=6dcc468fd
Annotations:  checksum/rasa-config: 4d99db7beb15d9c7065c913c33e6d17c813cd846c037ba4e710c4a145d54fb48
              checksum/rasa-secret: 52dda2e77938832263b5699996ee0f7054ed130539453b9e7bbde03272b6411f
              kubernetes.io/psp: eks.privileged
Status:       Running
IP:           192.168.31.75
IPs:
  IP:           192.168.31.75
Controlled By:  ReplicaSet/rasa-x-1641746175-rasa-worker-6dcc468fd
Init Containers:
  init-db:
    Container ID:  docker://0b19460978031276d5af9553489d91a1c6762f5adca4c0f526e36d331606dda8
    Image:         rasa/rasa:2.8.15-full
    Image ID:      docker-pullable://rasa/[email protected]:c6cdf4218b1017abbfcca70df9c842602e2398a3c1191962a7c7eb3d4e6e974b
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      until [[ "$(curl -s http://rasa-x-1641746175-db-migration-service-headless:8000 | grep -c completed)" == "1" ]]; do STATUS=$(curl -s http://rasa-x-1641746175-db-migration-service-headless:8000); if [[ -n "$STATUS" ]];then echo $STATUS; fi; sleep 5; done;
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 09 Jan 2022 22:07:38 +0530
      Finished:     Sun, 09 Jan 2022 22:13:46 +0530
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:         <none>
Containers:
  rasa-x:
    Container ID:  docker://b75e06c42f8a3dfcb9334d30ac6c06f7d64e9f9766496278a23659c5389ea450
    Image:         rasa/rasa:2.8.15-full
    Image ID:      docker-pullable://rasa/[email protected]:c6cdf4218b1017abbfcca70df9c842602e2398a3c1191962a7c7eb3d4e6e974b
    Port:          5005/TCP
    Host Port:     0/TCP
    Args:
      x
      --no-prompt
      --production
      --config-endpoint
      http://rasa-x-1641746175-rasa-x.rasahr.svc:5002/api/config?token=$(RASA_X_TOKEN)
      --port
      5005
      --jwt-method
      HS256
      --jwt-secret
      $(JWT_SECRET)
      --auth-token
      $(RASA_TOKEN)
      --cors
      *
    State:          Running
      Started:      Sun, 09 Jan 2022 22:13:49 +0530
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:http/ delay=10s timeout=1s period=10s #success=1 #failure=10
    Environment:
      MPLCONFIGDIR:            /tmp/.matplotlib
      DB_PASSWORD:             <set to the key 'postgresql-password' in secret 'rasa-x-1641746175-postgresql'>  Optional: false
      DB_DATABASE:             worker_tracker
      RASA_X_TOKEN:            <set to the key 'rasaXToken' in secret 'rasa-x-1641746175-rasa'>           Optional: false
      RASA_TOKEN:              <set to the key 'rasaToken' in secret 'rasa-x-1641746175-rasa'>            Optional: false
      JWT_SECRET:              <set to the key 'jwtSecret' in secret 'rasa-x-1641746175-rasa'>            Optional: false
      REDIS_PASSWORD:          <set to the key 'redis-password' in secret 'rasa-x-1641746175-redis'>      Optional: false
      RABBITMQ_PASSWORD:       <set to the key 'rabbitmq-password' in secret 'rasa-x-1641746175-rabbit'>  Optional: false
      RABBITMQ_QUEUE:          rasa_production_events
      RASA_ENVIRONMENT:        worker
      RASA_MODEL_SERVER:       http://rasa-x-1641746175-rasa-x.rasahr.svc:5002/api/projects/default/models/tags/production
      RASA_DUCKLING_HTTP_URL:  http://rasa-x-1641746175-duckling.rasahr.svc:8000
    Mounts:
      /.config from config-dir (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  config-dir:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:   <unset>
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

This is possibly related to the issue I still have here