Chinese Pipeline Suggestion

Any Chinese users here, I build my pipeline with configure below:

language: "zh"

pipeline:
  - name: "JiebaTokenizer"
    dictionary_path: "data/dict"
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
    constrain_similarities: true
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100
    constrain_similarities: true
  - name: FallbackClassifier
    threshold: 0.7
    ambiguity_threshold: 0.1

I have run it, it seems not good enough, some sentences can not been recognized. I have tried to replace CountVectorsFeaturizer with bert, but things not improve. I wish anyone can give me some suggestion to configure Chinese pipeline.

Hi @ILG2021, just to check … have you seen our blogpost on Non-English NLU pipelines? It has a small section that’s specific for Chinese.

Yes, I have read it.

Do the components listed in that blogpost help you?

I have run it, it seems not good enough, some sentences can not been recognized

Just to confirm the expectation, what makes you say the performance isn’t “good enough”? After all, we’re running a statistical machine learning algorithm. It’s likely that it won’t be able to predict everything perfectly.