Config for FAQ Bot in Chinese

Hello,

my latest project is a chatbot in Chinese. This should only answer simple FAQs. I already have the first intents and stories together. But now I am missing the right config. I found some blog articles and repos (https://github.com/crownpku/Rasa_NLU_Chi, Non-English Tools for Rasa NLU | The Rasa Blog | Rasa), but I couldn’t do anything with them (e.g. I do not have the file data/total_word_feature_extractor_zh.dat and other problems)

I wanted to ask if someone could give me a simple, simple config for a FAQ bot in Chinese.

rasa version:

Rasa Version      :         3.0.6
Minimum Compatible Version: 3.0.0
Rasa SDK Version  :         3.0.4
Rasa X Version    :         None
Python Version    :         3.7.2
Operating System  :         Darwin-20.6.0-x86_64-i386-64bit
Python Path       :         /Users/theresa/.pyenv/versions/3.7.2/bin/python3.7

Thanks! Theresa

@threxx - I am not sure if you will find an optimized pipeline for chinese that just works out of the box so you will have to finetune it for your data. Here’s a simple one that worked for me for intent classification.

p.s this is simplified chinese

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: zh

pipeline:
  - name: JiebaTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
    OOV_token: "oov"
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: duckling_extractor.duckling.DucklingEntityExtractor
    # url of the running duckling server
    url: "http://localhost:8000"
    # dimensions to extract
    dimensions:
      [
        "time",
        "number",
        "amount-of-money",
        "distance",
        "sys-number",
        "sys-currency",
      ]
    # if not set the default timezone of Duckling is going to be used
    # needed to calculate dates from relative expressions like "tomorrow"
    timezone: "Europe/Berlin"
    # Timeout for receiving response from http url of the running duckling server
    # if not set the default timeout of duckling http url is set to 3 seconds.
    timeout: 3
    locale: "zh_ZH"
  - name: FallbackClassifier
    threshold: 0.3
    ambiguity_threshold: 0.1
policies:
# # No configuration for policies was provided. The following default policies were used to train your model.
# # If you'd like to customize them, uncomment and adjust the policies.
# # See https://rasa.com/docs/rasa/policies for more information.
#   - name: MemoizationPolicy
#   - name: TEDPolicy
#     max_history: 5
#     epochs: 100
#     constrain_similarities: true
#   - name: RulePolicy
2 Likes

@souvikg10 thanks for your quick response. I just tested it with my few intents and it seems to work.

I found the really useful config for Chinese.

recipe: default.v1

language: zh

pipeline:
- name: "MitieNLP"
  model: "/home/nlu/data/total_word_feature_extractor_zh.dat"
- name: "JiebaTokenizer"
  dictionary_path: "dict/"
  "intent_tokenization_flag": false  
  "intent_split_symbol": "_"  
  "token_pattern": None
- name: "MitieFeaturizer" 
  pooling: "mean"
- name: "RegexFeaturizer"
- name: "CountVectorsFeaturizer"
- name: "CountVectorsFeaturizer"
  analyzer: "word"
  min_ngram: 1
  max_ngram: 4
- name: "CRFEntityExtractor"
  BILOU_flag: true
  "features": [
    ["low", "title", "upper"],
    ["bias","low","prefix5","prefix2","suffix5","suffix3","suffix2","upper","title","digit","pattern" ],
    ["low", "title", "upper"],
  ]
  max_iterations: 50
- name: "RegexEntityExtractor"
  case_sensitive: false
  use_lookup_tables: true
  use_regexes: true
  use_word_boundaries: true
- name: "SklearnIntentClassifier"
- name: "EntitySynonymMapper" 

policies:
- name: AugmentedMemoizationPolicy
- name: RulePolicy
  core_fallback_threshold: 0.3
  core_fallback_action_name: "action_default_fallback"
  enable_fallback_prediction: true
- name: TEDPolicy
  epochs: 100
1 Like