I could not train data in kubenetes cluster when training data is around 47k examples

Dear All,

I install RASA on kubenetes cluster.

I’m trying to train an nlu file with data around 47K examples. But i can not, i dont see any error on rasa-x, rasa-worker. it takes around 30 mins and finished without trained models.

Please advise this? it caused by the too much nlu data? or anything else?

Hi @nhha1602, are you training this in the rasa x UI? I doubt the problem is too much data. Can you see your data, configs & domain in the UI and do they all look correct?

Hi, I used rasa for Vietnamese My configure as below. I tried to manually training and found it is out of memory (my server has 32G). It was killed when start to load: CRFEntityExtractor

I used following packages for Vietnamese.

Please advise this.

pip3 install spacy==2.2.3
pip3 install pyvi
pip3 install https://github.com/trungtv/vi_spacy/raw/master/packages/vi_spacy_model-0.2.1/dist/vi_spacy_model-0.2.1.tar.gz
language: vi_spacy_model
pipeline:
  - name: SpacyNLP
  - name: SpacyTokenizer
  - name: SpacyFeaturizer
  - name: RegexFeaturizer
  - name: CRFEntityExtractor
    features:
      - - low
        - title
        - upper
      - - bias
        - low
        - prefix5
        - prefix2
        - suffix5
        - suffix3
        - suffix2
        - upper
        - title
        - digit
        - pattern
      - - low
        - title
        - upper
  - name: EntitySynonymMapper
  - name: SklearnIntentClassifier

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy
  - name: FallbackPolicy
    nlu_threshold: 0.4
    core_threshold: 0.3
    fallback_action_name: action_default_fallback

Log of rasa train:

Training NLU model...
2020-03-03 15:33:11 INFO     rasa.nlu.utils.spacy_utils  - Trying to load spacy model with name 'vi_spacy_model'

2020-03-03 15:33:11 INFO     rasa.nlu.components  - Added 'SpacyNLP' to component cache. Key 'SpacyNLP-vi_spacy_model'.

2020-03-03 15:36:22 INFO     rasa.nlu.training_data.training_data  - Training data stats:

        - intent examples: 700243 (8 distinct intents)
        - Found intents: 'support_business_system', 'greet', 'goodbye', 'confirm_yes', 'thank', 'confirm_no', 'more_support', 'change_acct_system'
        - Number of response examples: 0 (0 distinct response)
        - entity examples: 699814 (17 distinct entities)
        - found entities: 'ent_business', 'evr_syscitad', 'func_confirm_yes', 'func_unlock', 'evr_systfr', 'evr_syst24', 'func_support', 'func_confirm_no', 'func_change', 'ent_acct', 'func_exten', 'func_rspass', 'evr_sysswift', 'func_create', 'more_support', 'evr_sysway4', 'func_disable'

2020-03-03 15:37:09 INFO     rasa.nlu.model  - Starting to train component SpacyNLP

2020-03-03 15:44:18 INFO     rasa.nlu.model  - Finished training component.

2020-03-03 15:44:18 INFO     rasa.nlu.model  - Starting to train component SpacyTokenizer

2020-03-03 15:44:52 INFO     rasa.nlu.model  - Finished training component.

2020-03-03 15:44:52 INFO     rasa.nlu.model  - Starting to train component SpacyFeaturizer

2020-03-03 15:45:23 INFO     rasa.nlu.model  - Finished training component.

2020-03-03 15:45:23 INFO     rasa.nlu.model  - Starting to train component RegexFeaturizer

2020-03-03 15:45:23 INFO     rasa.nlu.model  - Finished training component.

2020-03-03 15:45:23 INFO     rasa.nlu.model  - Starting to train component CRFEntityExtractor

Killed

Thanks for the logs. It seems like the language model is loading fine. I can test it locally with your config and the language model and everything runs. Did you start noticing the problem as the amount of your data increased? What happens if you try the same thing with a small dataset?

Hi,

It is fine with small dataset. I used chatitio, and as the output of chatito, it has total 11M records, and i take about 70% of 11M. It had created a json file, i converted it to md file and then run rasa train.

So, i think i got issue with my pipeline … please help to review my config and comment the pipeline, if it is good for my language - Vietnamese then I will try to increase my memory.

Thanks.

I’m not familiar with Vietnamese, so I can’t comment on the features for the CRFEntityExtractor (you’d probably know better than me whether that makes sense), but using a vietnamese-specific language model for tokenization/featurization is a good start. Note that Rasa 1.8 also introduces some other options for language models.

What are you setting the memory to?

Hi,

I figured it out. Just reduced the number of examples in nlu.md and it worked. I also upgrade to Rasa 1.8. Please advise options for language models.

My memory is 16G.

Thanks and regards.

Glad you got it to work. I’d recommend taking a look at your options here (you’ll have to check what works for Vietnamese) and comparing pipelines using different components and see what gives you the best results. This will depend on your data, so the best is just to experiment and see what works.