Rasa X says training failed even though it didn't

Version: Rasa X 0.37.1 Rasa-X Helm Chart installation

rasax:
    # tag refers to the Rasa X image tag
    tag: "0.37.1"
# rasa: Settings common to all Rasa containers
rasa:
    # tag refers to the Rasa image tag
    tag: "2.3.1-full"
    additionalChannelCredentials:
        rest:

When I set a model to train over the UI (Training → Update Model → Train Model), I get an Error Message telling me training failed and nothing else:

image

But if I check the logs in the production container, I see that the training keeps on going.

If I check in the Rasa X Models section after a while I see a new Model there. If I activate it, it works perfectly even though it “Failed”.

Same with me.

Rasa X 0.39.3 en Rasa 2.4.3-full, Openshift installation

  1. Rasa X, training reports “failed”
  2. After a few minutes model shows up in model tab
  3. I activate the new trained model (that was reported failed!)
  4. Conversations perform as I was expecting

So, eíther is there some treshold for this failure signal to appear , or it is a bug?

Note: On CLI at linux server, model training with same traningdata and config gives me no training failor. I only get one warning, when validating the data, i.e. the retrieval utterance has got no matching response (known bug, see Reponse validation gives incorrect warning for sub-intent responses · Issue #8070 · RasaHQ/rasa · GitHub)

2021-05-31 10:57:34 INFO rasa.validator - Validating utterances… /srv/shared/huijh03/venv_rasa2.3/lib64/python3.6/site-packages/rasa/shared/utils/io.py:93: UserWarning: The action ‘utter_ict_faq’ is used in the stories, but is not a valid utterance action. Please make sure the action is listed in your domain and there is a template defined with its name. More info at Actions Project validation completed with errors.

1 Like

I also get a warning about a retrieval intent:

UserWarning: Action 'utter_faq' is listed as a response action in the domain file, but there is no matching response defined. Please check your domain.

This is probably related then.

Has anybody who read this posts a comment, tip, reply?

I’ve upgraded Rasa-x to 0.40 and OS to 2.6 and the problem got worse. Now I get the “Training failed” message, but the model doesn’t appear in the models list after a while . I tried to do some digging and found following errors and logs.

  1. In the browser dev tools:
Request URL:
    https://chatbot.url/api/projects/default/models/jobs
Request Method:
    POST
Status Code:
    504 Gateway Time-out
Remote Address:
    10.10.43.190:443
Referrer Policy:
    strict-origin-when-cross-origin
  1. In the ingress-controller logs: ​
2021/06/30 14:16:52 [error] 5667#5667: *59921060 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 10.10.43.109, server: chatbot.url, request: "POST /api/projects/default/models/jobs HTTP/1.1", upstream: "http://10.42.1.136:8080/api/projects/default/models/jobs", host: "chatbot.url", referrer: "https://chatbot.url/models"

10.10.43.109 - - [30/Jun/2021:14:16:52 +0000] "POST /api/projects/default/models/jobs HTTP/1.1" 504 562 "https://chatbot.url/models" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" 6349 60.002 [chatbot-train-chatbot-train-rasa-x-nginx-8000] [] 10.42.1.136:8080 0 60.001 504 cd84db6a183ada7f94d9d83d6fca1859
  1. In the rasa-x-nginx logs:
10.42.2.0 - - [30/Jun/2021:14:34:02 +0000] "POST /api/projects/default/models/jobs HTTP/1.1" 500 148 "https://chatbot.url/models" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"

Note that at the time the ingress-controller logs a 504 (Gateway Timeout) there has not been a response from rasa-x-nginx yet.

The 500 (Internal Server Error) logged by the rasa-x-nginx is probably unrelated. It only started happening recently and happens when the rasa-x-worker pod crashes (17 minutes later). I will open a separate post about this.

The ingress was throwing the timeout, a simple annotation fixed the issue of the 504:

ingress:
    hosts:
        [...]
    annotations:
        nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"