Has anyone have success in migrating zh-CN based Dialogflow agent to Rasa?
When use default config the trained model appears to be able to detect the intent correctly, but not extracting entities like it would in Dialogflow.
Configs used as below:
language: zh-CN pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4
- name: DIETClassifier epochs: 100 constrain_similarities: true
- name: EntitySynonymMapper
- name: ResponseSelector epochs: 100 constrain_similarities: true
- name: FallbackClassifier threshold: 0.3 ambiguity_threshold: 0.1
The training process does show the following info:
/lib/python3.8/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message âćŠä¸ĺé˘ćĄâ with intent âdiabetestalk.agent.log_unknown_foodâ. Make sure the start and end values of entities ([(0, 2, âćŠä¸â), (2, 3, âĺâ), (3, 5, âé˘ćĄâ)]) in the training data match the token boundaries ([(0, 5, âćŠä¸ĺé˘ćĄâ)]). Common causes:
- entities include trailing whitespaces or punctuation
- the tokenizer gives an unexpected result, due to languages such as Chinese that donât use whitespace for word separation More info at Training Data Format
Instead of the default white space tokenizer, I also tried to use the Chinese based tokenizer Jieba, below is my configs:
language: âzhâ pipeline:
- name: âMitieNLPâ model: âdata/total_word_feature_extractor_zh.datâ
- name: âJiebaTokenizerâ
- name: âMitieEntityExtractorâ
- name: âEntitySynonymMapperâ
- name: âRegexFeaturizerâ
- name: âMitieFeaturizerâ
- name: âSklearnIntentClassifierâ
- name: ResponseSelector epochs: 100 constrain_similarities: true
This configuration canât run the training at all, finished very fast and canât detect any intent or entities.
Please help!