Tokenizer for language without space such as Japanese


I’m adding a Japanese support function to my bot, which means i have to clone my Rasa bot to make a new one in Japanese. I’m using the default pipeline “supervised_embedding” which use the Whitespace_Tokenizer. This tokenizer can not be used with language with no space such as Japanese, Chinese,… So i wonder if there is a tokenizer from Rasa that i can use to deal with this kind of language.

I read from the docs that the Jieba Tokenizer supports Chinese, can it be use on Japanese too or i have to make a custom tokenizer ? If the latter is true, does any one know a reliable tokenizer for Japanese that can be easily integrated with Rasa. Thank you.

A quick google search showed that spaCy got a japansese tokenizer.

Even found a guide Setting up Japanese NLP with spaCy and MeCab for setup.

@IgNoRaNt23 Awesome ! Although it seems like i will have to install various old version dependencies (as the guide suggest), which probably will cause some problems because of my limited experience and knowledge. Nonetheless, it is definitely a potential solution to my problem.

I have one more question, after installing all the required components, and i can use spaCy for Japanese, do i have to create a custom component for the tokenizer or i can just use the pretrained_embeddings_spacy like this:

language: "jp"

pipeline: "pretrained_embeddings_spacy"

Thank you for your help.

You will need a custom tokenizer.

1 Like