1.3 Training Crash during EmbeddingIntentClassifier

Model training is crashing on me under 1.3.9 during EmbeddingIntentClassifier phase. Here are the last couple of lines:

2019-10-14 15:40:44 INFO     rasa.nlu.model  - Starting to train component CountVectorsFeaturizer
2019-10-14 15:40:45 INFO     rasa.nlu.model  - Finished training component.
2019-10-14 15:40:45 INFO     rasa.nlu.model  - Starting to train component EmbeddingIntentClassifier
Epochs:   0%|          | 1/300 [00:14<1:12:17, 14.51s/it, loss=62.778, acc=0.237]kyc-rasax >

If I run the training under 1.2.11, there is no failure.

I changed the config from:

pipeline: supervised_embeddings

to

pipeline:
 - name: "WhitespaceTokenizer"
 - name: "RegexFeaturizer"
 - name: "CRFEntityExtractor"
 - name: "EntitySynonymMapper"
 - name: "CountVectorsFeaturizer"
 - name: "EmbeddingIntentClassifier"
   num_neg: 300

But the train still crashes.

1 Like

I’ll try to reproduce this, I’m assuming nothing too crazy in the training set either right?

So this basic config from the Rasa Demo bot seems to be ok, going to adjust it now to your settings and see what I get.

language: en
pipeline:
- name: WhitespaceTokenizer
- name: CRFEntityExtractor
- name: CountVectorsFeaturizer
  OOV_token: oov
  token_pattern: (?u)\b\w+\b
- name: EmbeddingIntentClassifier
  epochs: 50
- name: DucklingHTTPExtractor
  url: http://localhost:8000
  dimensions:
  - email
  - number
  - amount-of-money
- name: EntitySynonymMapper

policies:
- epochs: 50
  max_history: 6
  name: KerasPolicy
- max_history: 6
  name: AugmentedMemoizationPolicy
- core_threshold: 0.3
  name: TwoStageFallbackPolicy
  nlu_threshold: 0.8
- name: FormPolicy
- name: MappingPolicy

Thanks, Brian. I tried your pipeline and get the same failure with the EmbeddingIntentClassifier. I have no problems using the 1.3.9 version with a smaller bot I have (19 unique intents) but with this larger (279 intents) bot it is failing.

I’m stuck on 1.2.11 until I can figure this out. May be time for a github issue.

I’ve opened issue #4616 and also think this could be related to hyperparameter changes between 1.2 & 1.3 as discussed in #4540.

what is the error?

No actual error. It sits at the 0% message for 20 seconds and stops.

I’ve played around with the hyperparameters (num_neg and batch_strategy: sequence) a little bit and had it get to 2% once but no further.

Epochs:   0%|          | 0/300 [00:00<?, ?it/s]

Here’s a pastebin of the full training session with verbose logging.

Looking through the log, I just noticed a difference in the 1.2.11 a 1.3.9 output.

Here is my training command:

#export RASA_VERS=1.2.11-full
export RASA_VERS=1.3.9-full
docker run -v $(pwd):/app rasa/rasa:${RASA_VERS} train --config /app/data/config.yml --out /app/models --domain /app/data/domain.yml --data /app/data/training /app/data/stories -vv

The NLU training data load reports the store format as unk in 1.3.9 vs md in 1.2.11.

1.2.11 message related to story format:

2019-10-16 14:37:06 DEBUG    rasa.nlu.training_data.loading  - Training data format of '/app/data/training/xcen.md' is 'md'.
2019-10-16 14:37:06 DEBUG    rasa.nlu.training_data.loading  - Training data format of '/app/data/training/axl.md' is 'md'.
2019-10-16 14:37:06 DEBUG    rasa.nlu.training_data.loading  - Training data format of '/app/data/training/main.md' is 'md'.
2019-10-16 14:37:06 DEBUG    rasa.nlu.training_data.loading  - Training data format of '/app/data/training/glossary.md' is 'md'.
2019-10-16 14:37:06 DEBUG    rasa.nlu.training_data.loading  - Training data format of '/app/data/training/axl_faq.md' is 'md'.
2019-10-16 14:37:06 DEBUG    rasa.nlu.training_data.loading  - Training data format of '/app/data/training/faq.md' is 'md'.

1.3.9 message:

2019-10-16 14:19:31 DEBUG    rasa.nlu.training_data.loading  - Training data format of '/app/data/stories/glossary.md' is 'unk'.
2019-10-16 14:19:31 DEBUG    rasa.nlu.training_data.loading  - Training data format of '/app/data/stories/faq.md' is 'unk'.
2019-10-16 14:19:31 DEBUG    rasa.nlu.training_data.loading  - Training data format of '/app/data/stories/main.md' is 'unk'.
2019-10-16 14:19:31 DEBUG    rasa.nlu.training_data.loading  - Training data format of '/app/data/stories/axl_faq.md' is 'unk'.
2019-10-16 14:19:31 DEBUG    rasa.nlu.training_data.loading  - Training data format of '/app/data/stories/xcen.md' is 'unk'.
2019-10-16 14:19:31 DEBUG    rasa.nlu.training_data.loading  - Training data format of '/app/data/stories/axl.md' is 'unk'.

By stops, do you mean it runs out of memory?

I don’t see any error message related to memory. I have 12Gb of memory and my docker engine has a default memory allocation of 2Gb. I increased that to 6Gb and restarted the docker engine and ran the training and it worked. Thanks!

I also do training on a separate CI/CD system that doesn’t have this much memory. Is there a set of parameters that will run the EmbeddingIntentClassifier the same as 1.2.11 does? I’ve tried the batch_strategy: sequence option but it still requires more memory.

I saw your post under issue #4540 about the char level count vectorizer. Is there a hyperparameter to disable or configure this?

yes, please take a look at nlu pipeline docs: Choosing a Pipeline you need to remove char level count vectorizer from the config

1 Like