How to set pipeline config with other language

After i already test my code with splot task and it’s work well with English. Now, i change my data to Thai language, train and test. But, the default setting of pipeline is not provide accurate result anymore. I think it cause of tokenization. Therefore, i try to change tokenization method in pipeline as follows:

language: th

pipeline:

  • name: “SpacyNLP”

model: “xx_ent_wiki_sm”

  • name: “SpacyTokenizer”

Spacy lib and spacy model “xx_ent_wiki_sm” are both installed. But it still inaccurate. I have three questions:

  1. The following settings is correct or incorrect ?
  2. Are there any example for custom tokenization, featurizer and classifier with own custom .py?
  3. What is the default settings[tokenization, featurizer, classifier model] of pipeline? [in case you input nothing]

Thank for your replying

1 Like