Slow Training Chinese NLU Model

Qianyue · August 6, 2018, 6:13pm

Hi Guys, I am training a rasa nlu model in Chinese. The training time is very long. I think it is due to the parsing sentence. The size of training data is 600 with 10 intents and 11 labels, and it take more than 15 hours to finish. Is there any way to improve this process. We need to update the model regularly. Thanks!

majian · August 7, 2018, 1:25am

You can try ‘spacy’ instead of ‘mitie’ in your model configuration file.

akelad · August 7, 2018, 1:36pm

@howlanderson you might have some insight on this as well?

Qianyue · August 7, 2018, 1:53pm

Really? But I think Spacy doesn’t support Chinese.

howlanderson · August 7, 2018, 2:43pm

@Qianyue I have a project that build SpaCy models for Chinese language AT https://github.com/howl-anderson/Chinese_models_for_SpaCy. I hope it can solve your problem.

howlanderson · August 7, 2018, 2:49pm

@akelad OK

howlanderson · August 7, 2018, 2:56pm

Can you give more details about how you train your Chinese NLU model? Shell command or scripts content would be very helpful. Usually, it don’t take much time. But if you are building a MITIE model that will take a very long time. So, if you can provide more details about how to train the model, we can locate the problem more quickly.

Qianyue · August 7, 2018, 2:57pm

Thank you!! I will definitely try this method. Did you use this model to build rasa nlu model and make dialogue bot?

Qianyue · August 7, 2018, 3:07pm

I am using an existing MITIE model for test. But the Rasa NLU training time is still long with the data size I described above. And we will add much more training data later.

I just followed the official guide to build the model. Use the pre-configured mitie pipeline, but replaced the tokenizer with ‘tokenizer_jieba’. The pipeline is below:

language: “zh”

pipeline:

name: “nlp_mitie” model: “data/total_word_feature_extractor_zh.dat”
name: “tokenizer_jieba”
name: “ner_mitie”
name: “ner_synonyms”
name: “intent_entity_featurizer_regex”
name: “intent_featurizer_mitie”
name: “intent_classifier_sklearn”

howlanderson · August 8, 2018, 3:07am

That’s weird, can you provide more details about your computer: OS, CPU, memory, hard disk type (SSD?). What’s your the version of python and RASA NLU.

Qianyue · August 9, 2018, 1:51pm

Sure! I am using MacBook Pro 2017, with 3.1GHz Intel Core i5 CPU, and 8GB memory. Thanks! Do you know the average training time for Jiebe-mitie model? Thanks!

howlanderson · August 11, 2018, 5:48am

I don’t record the training time, but I remember it’s pretty short. Can you train your model on dataset of Weather (Chinese version) from https://github.com/howl-anderson/NLU_benchmark_dataset, so we can using the same dataset then find out why.

Qianyue · August 13, 2018, 7:43pm

Thanks!! I just tried this dataset. The training time is 1396 seconds. I am thinking the problem might be the distribution of my training data?

howlanderson · August 14, 2018, 2:06am

It still needs more details, please add my WeChat account : here-we-meet, using IM to communicate is a better choice for this case.

libindavis · August 7, 2019, 6:45am

Hi, did you figure out the cause? I am also facing the same issue here.

howlanderson · August 7, 2019, 10:01am

It seems Qianyue Zhang don’t contact with me yet. @libindavis It needs more details, please add my WeChat account: here-we-meet, so I can take a close look for your case.

libindavis · August 7, 2019, 2:40pm

Hi Anderson, After a full day of debug and exlporing the google, I think it is caused by the MITIE running slow when trainning for entitiy extraction. I used ‘ner_mitie’ to do NER, and in the Part II, it takes very long time( at least 10 hours and still not end) to train with just 558 examples for 16 labels. But when I tried with training 58 examples for 2 lables, the training got completed very fast (less than 1 minute). So I think the root cause is from ner_mitie.
Do you know if there is an alternative pipeline component for doing NER for Chinese entities? ner_mitie is not a good candidate considering its long training time.
Thanks in advance

2019-08-06 23:38:25-0700 [-] Prefix dict has been built succesfully.
Training to recognize 16 labels: 'GE', 'prenatal', 'sick_leave', 'parental', 'benefits', 'anual_vacation', 'employ_cert', 'insurance', 'extra_insurance_people', 'extra_insurance', 'bereavement_leave', 'red_envelope', 'bereavement_envelope', 'acronym', 'GearStore', 'location'
Part I: train segmenter
words in dictionary: 200000
num features: 271
now do training
C:           20
epsilon:     0.01
num threads: 1
cache size:  5
max iterations: 2000
loss per missed segment:  3
C: 20   loss: 3         0.974638
C: 35   loss: 3         0.976449
C: 20   loss: 4.5       0.976449
C: 5   loss: 3  0.969203
C: 20   loss: 1.5       0.971014
C: 35   loss: 4.5       0.978261
C: 35   loss: 5.25      0.976449
C: 38   loss: 4.65      0.976449
C: 31.4985   loss: 4.36577      0.976449
C: 35.5866   loss: 4.36195      0.976449
C: 34.0102   loss: 4.65925      0.976449
C: 34.7478   loss: 4.50824      0.976449
best C: 35
best loss: 4.5
num feats in chunker model: 4095
train: precision, recall, f1-score: 0.991007 0.996383 0.993688
Part I: elapsed time: 244 seconds.

Part II: train segment classifier
now do training
num training samples: 558
C: 200   f-score: 0.99163
C: 400   f-score: 0.99163
C: 300   f-score: 0.99163
C: 100   f-score: 0.99163
C: 0.01   f-score: 0.954282
C: 600   f-score: 0.99163
C: 1400   f-score: 0.99163
C: 3000   f-score: 0.99163

howlanderson · August 8, 2019, 1:51am

Hi @libindavis, now I use my own component package: rasa-contrib @ GitHub - howl-anderson/rasa_contrib: rasa_contrib is a addon package for rasa. It provide some useful/powerful addition components, it provides some (almost) SOTA components for RASA, such as BiLSTM+CRF for NER, TextCNN for intent classification, also BERT based component is coming. it is good for user who have many training examples.

rookiebird · October 14, 2019, 12:22pm

I found that in the source code of the MitieEntityExtractor ,the train method can get an parameter “num_threads”. However,I found no way to denote it using cmdline interface . Does it help to speed up training ?

jingj5 · July 3, 2020, 2:08am

朋友，我跟你一样啊，你那边解决了这个问题吗？我试了下，是这个PIPELINE慢引起的，不过它抽取的实体的确是最准确的。如何有其它方案可以替换它，请回复一下吧，谢谢！忘记写PIPELINE了，这是这个东西：MitieEntityExtractor

Topic		Replies	Views
Rasa Model taking alot of time to train Rasa Open Source	9	2544	June 11, 2020
Chinese bot Getting Started with Rasa	3	477	February 5, 2020
Rasa NLU with spaCy large default model - en_core_web_lg Rasa Open Source	2	964	June 10, 2019
How to use rasa_nlu.train to train a model? Rasa Open Source	1	627	October 22, 2018
NLU training taking lot of time Rasa Open Source	12	1343	September 6, 2019

Slow Training Chinese NLU Model

Related topics