Adding a tokenizer to a predefined pipeline(for languages like Chinese)

I am trying to build a chineese chatbot with ‘tensorflow_embedding’ predefined pipeline. But as for chineese ‘tokenizer_whitespace’ won’t work, I want to try ‘tensorflow_embedding’ with ‘tokenizer_jieba’. So

  1. Is there a way to use a specific component with a specific predefined pipeline?
  2. If not where can I get the components used in ‘tensorflow_embedding’ pipeline, so that I can manually modify the config file with ‘tokenizer_jieba’?

Check out the docs for the JiebaTokenizer here

1 Like