Slow Training Chinese NLU Model

Hi Guys, I am training a rasa nlu model in Chinese. The training time is very long. I think it is due to the parsing sentence. The size of training data is 600 with 10 intents and 11 labels, and it take more than 15 hours to finish. Is there any way to improve this process. We need to update the model regularly. Thanks!

You can try ‘spacy’ instead of ‘mitie’ in your model configuration file.

@howlanderson you might have some insight on this as well?

Really? But I think Spacy doesn’t support Chinese.

@Qianyue I have a project that build SpaCy models for Chinese language AT https://github.com/howl-anderson/Chinese_models_for_SpaCy. I hope it can solve your problem.

@akelad OK

1 Like

Can you give more details about how you train your Chinese NLU model? Shell command or scripts content would be very helpful. Usually, it don’t take much time. But if you are building a MITIE model that will take a very long time. So, if you can provide more details about how to train the model, we can locate the problem more quickly.

Thank you!! I will definitely try this method. Did you use this model to build rasa nlu model and make dialogue bot?

I am using an existing MITIE model for test. But the Rasa NLU training time is still long with the data size I described above. And we will add much more training data later.

I just followed the official guide to build the model. Use the pre-configured mitie pipeline, but replaced the tokenizer with ‘tokenizer_jieba’. The pipeline is below:

language: “zh”

pipeline:

  • name: “nlp_mitie” model: “data/total_word_feature_extractor_zh.dat”
  • name: “tokenizer_jieba”
  • name: “ner_mitie”
  • name: “ner_synonyms”
  • name: “intent_entity_featurizer_regex”
  • name: “intent_featurizer_mitie”
  • name: “intent_classifier_sklearn”

That’s weird, can you provide more details about your computer: OS, CPU, memory, hard disk type (SSD?). What’s your the version of python and RASA NLU.

Sure! I am using MacBook Pro 2017, with 3.1GHz Intel Core i5 CPU, and 8GB memory. Thanks! Do you know the average training time for Jiebe-mitie model? Thanks!

I don’t record the training time, but I remember it’s pretty short. Can you train your model on dataset of Weather (Chinese version) from https://github.com/howl-anderson/NLU_benchmark_dataset, so we can using the same dataset then find out why.

Thanks!! I just tried this dataset. The training time is 1396 seconds. I am thinking the problem might be the distribution of my training data?

It still needs more details, please add my WeChat account : here-we-meet, using IM to communicate is a better choice for this case.

Hi, did you figure out the cause? I am also facing the same issue here.

It seems Qianyue Zhang don’t contact with me yet. @libindavis It needs more details, please add my WeChat account: here-we-meet, so I can take a close look for your case.

Hi Anderson, After a full day of debug and exlporing the google, I think it is caused by the MITIE running slow when trainning for entitiy extraction. I used ‘ner_mitie’ to do NER, and in the Part II, it takes very long time( at least 10 hours and still not end) to train with just 558 examples for 16 labels. But when I tried with training 58 examples for 2 lables, the training got completed very fast (less than 1 minute). So I think the root cause is from ner_mitie.
Do you know if there is an alternative pipeline component for doing NER for Chinese entities? ner_mitie is not a good candidate considering its long training time.
Thanks in advance

2019-08-06 23:38:25-0700 [-] Prefix dict has been built succesfully.
Training to recognize 16 labels: 'GE', 'prenatal', 'sick_leave', 'parental', 'benefits', 'anual_vacation', 'employ_cert', 'insurance', 'extra_insurance_people', 'extra_insurance', 'bereavement_leave', 'red_envelope', 'bereavement_envelope', 'acronym', 'GearStore', 'location'
Part I: train segmenter
words in dictionary: 200000
num features: 271
now do training
C:           20
epsilon:     0.01
num threads: 1
cache size:  5
max iterations: 2000
loss per missed segment:  3
C: 20   loss: 3         0.974638
C: 35   loss: 3         0.976449
C: 20   loss: 4.5       0.976449
C: 5   loss: 3  0.969203
C: 20   loss: 1.5       0.971014
C: 35   loss: 4.5       0.978261
C: 35   loss: 5.25      0.976449
C: 38   loss: 4.65      0.976449
C: 31.4985   loss: 4.36577      0.976449
C: 35.5866   loss: 4.36195      0.976449
C: 34.0102   loss: 4.65925      0.976449
C: 34.7478   loss: 4.50824      0.976449
best C: 35
best loss: 4.5
num feats in chunker model: 4095
train: precision, recall, f1-score: 0.991007 0.996383 0.993688
Part I: elapsed time: 244 seconds.

Part II: train segment classifier
now do training
num training samples: 558
C: 200   f-score: 0.99163
C: 400   f-score: 0.99163
C: 300   f-score: 0.99163
C: 100   f-score: 0.99163
C: 0.01   f-score: 0.954282
C: 600   f-score: 0.99163
C: 1400   f-score: 0.99163
C: 3000   f-score: 0.99163

Hi @libindavis, now I use my own component package: rasa-contrib @ GitHub - howl-anderson/rasa_contrib: rasa_contrib is a addon package for rasa. It provide some useful/powerful addition components, it provides some (almost) SOTA components for RASA, such as BiLSTM+CRF for NER, TextCNN for intent classification, also BERT based component is coming. it is good for user who have many training examples.

1 Like

I found that in the source code of the MitieEntityExtractor ,the train method can get an parameter “num_threads”. However,I found no way to denote it using cmdline interface . Does it help to speed up training ?

1 Like

朋友,我跟你一样啊,你那边解决了这个问题吗? 我试了下,是这个PIPELINE慢引起的,不过它抽取的实体的确是最准确的。如何有其它方案可以替换它,请回复一下吧,谢谢! 忘记写PIPELINE了,这是这个东西:MitieEntityExtractor