Dialogflow migrating for Chinese Agent

Has anyone have success in migrating zh-CN based Dialogflow agent to Rasa?

When use default config the trained model appears to be able to detect the intent correctly, but not extracting entities like it would in Dialogflow.

Configs used as below:

language: zh-CN pipeline:

  • name: WhitespaceTokenizer
  • name: RegexFeaturizer
  • name: LexicalSyntacticFeaturizer
  • name: CountVectorsFeaturizer
  • name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4
  • name: DIETClassifier epochs: 100 constrain_similarities: true
  • name: EntitySynonymMapper
  • name: ResponseSelector epochs: 100 constrain_similarities: true
  • name: FallbackClassifier threshold: 0.3 ambiguity_threshold: 0.1

The training process does show the following info:

/lib/python3.8/site-packages/rasa/shared/utils/io.py:97: UserWarning: Misaligned entity annotation in message ‘早上吃面条’ with intent ‘diabetestalk.agent.log_unknown_food’. Make sure the start and end values of entities ([(0, 2, ‘早上’), (2, 3, ‘吃’), (3, 5, ‘面条’)]) in the training data match the token boundaries ([(0, 5, ‘早上吃面条’)]). Common causes:

  1. entities include trailing whitespaces or punctuation
  2. the tokenizer gives an unexpected result, due to languages such as Chinese that don’t use whitespace for word separation More info at Training Data Format

Instead of the default white space tokenizer, I also tried to use the Chinese based tokenizer Jieba, below is my configs:

language: “zh” pipeline:

  • name: “MitieNLP” model: “data/total_word_feature_extractor_zh.dat”
  • name: “JiebaTokenizer”
  • name: “MitieEntityExtractor”
  • name: “EntitySynonymMapper”
  • name: “RegexFeaturizer”
  • name: “MitieFeaturizer”
  • name: “SklearnIntentClassifier”
  • name: ResponseSelector epochs: 100 constrain_similarities: true

This configuration can’t run the training at all, finished very fast and can’t detect any intent or entities.

Please help!

You can use a empty project to verify you new config. If work fine, you need to check you corpus.

Thanks for the suggestion. When creating a new project, the default sample file is set up in English, yet the config that I’m trying to run with needs to work for Chinese. Any idea on how to create new project in Chinese?

If the config work for english is ok, you only need to change the corpus to Chinese. You can refer the project https://github.com/Dustyposa/rasa_ch_faq. used bert.

1 Like