@threxx - I am not sure if you will find an optimized pipeline for chinese that just works out of the box so you will have to finetune it for your data. Here’s a simple one that worked for me for intent classification.
p.s this is simplified chinese
# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: zh
pipeline:
- name: JiebaTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
OOV_token: "oov"
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: duckling_extractor.duckling.DucklingEntityExtractor
# url of the running duckling server
url: "http://localhost:8000"
# dimensions to extract
dimensions:
[
"time",
"number",
"amount-of-money",
"distance",
"sys-number",
"sys-currency",
]
# if not set the default timezone of Duckling is going to be used
# needed to calculate dates from relative expressions like "tomorrow"
timezone: "Europe/Berlin"
# Timeout for receiving response from http url of the running duckling server
# if not set the default timeout of duckling http url is set to 3 seconds.
timeout: 3
locale: "zh_ZH"
- name: FallbackClassifier
threshold: 0.3
ambiguity_threshold: 0.1
policies:
# # No configuration for policies was provided. The following default policies were used to train your model.
# # If you'd like to customize them, uncomment and adjust the policies.
# # See https://rasa.com/docs/rasa/policies for more information.
# - name: MemoizationPolicy
# - name: TEDPolicy
# max_history: 5
# epochs: 100
# constrain_similarities: true
# - name: RulePolicy