Chinese Pipeline Suggestion

ILG2021 · December 2, 2021, 12:21pm

Any Chinese users here, I build my pipeline with configure below:

language: "zh"

pipeline:
  - name: "JiebaTokenizer"
    dictionary_path: "data/dict"
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
    constrain_similarities: true
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100
    constrain_similarities: true
  - name: FallbackClassifier
    threshold: 0.7
    ambiguity_threshold: 0.1

I have run it, it seems not good enough, some sentences can not been recognized. I have tried to replace CountVectorsFeaturizer with bert, but things not improve. I wish anyone can give me some suggestion to configure Chinese pipeline.

koaning · December 6, 2021, 12:42pm

Hi @ILG2021, just to check … have you seen our blogpost on Non-English NLU pipelines? It has a small section that’s specific for Chinese.

ILG2021 · December 6, 2021, 1:31pm

Yes, I have read it.

koaning · December 6, 2021, 3:24pm

Do the components listed in that blogpost help you?

I have run it, it seems not good enough, some sentences can not been recognized

Just to confirm the expectation, what makes you say the performance isn’t “good enough”? After all, we’re running a statistical machine learning algorithm. It’s likely that it won’t be able to predict everything perfectly.

Topic		Replies	Views
Cannot use JiebaTokenizer with bert and DIETClassifier Rasa Open Source	10	1199	February 8, 2022
LanguageModelFeaturizer in pipeline dont work! Rasa Open Source	2	612	May 12, 2022
Clarification regarding NLU Pipeline and DIETClassifier Rasa Open Source	4	1321	March 4, 2021
NLU customization for Arabic language Rasa Open Source	8	1372	May 2, 2023
Rasa Pipeline Doubt Rasa Open Source	2	413	June 24, 2020

Chinese Pipeline Suggestion

Related Topics