Tokenizer for language without space such as Japanese

Hello,

I’m adding a Japanese support function to my bot, which means i have to clone my Rasa bot to make a new one in Japanese. I’m using the default pipeline “supervised_embedding” which use the Whitespace_Tokenizer. This tokenizer can not be used with language with no space such as Japanese, Chinese,… So i wonder if there is a tokenizer from Rasa that i can use to deal with this kind of language.

I read from the docs that the Jieba Tokenizer supports Chinese, can it be use on Japanese too or i have to make a custom tokenizer ? If the latter is true, does any one know a reliable tokenizer for Japanese that can be easily integrated with Rasa. Thank you.

A quick google search showed that spaCy got a japansese tokenizer.

Even found a guide Setting up Japanese NLP with spaCy and MeCab for setup.

1 Like

@IgNoRaNt23 Awesome ! Although it seems like i will have to install various old version dependencies (as the guide suggest), which probably will cause some problems because of my limited experience and knowledge. Nonetheless, it is definitely a potential solution to my problem.

I have one more question, after installing all the required components, and i can use spaCy for Japanese, do i have to create a custom component for the tokenizer or i can just use the pretrained_embeddings_spacy like this:

language: "jp"

pipeline: "pretrained_embeddings_spacy"

Thank you for your help.

You will need a custom tokenizer.

1 Like

Hi, did you eventually find a way to deal with the Japanese tokenizer? Did you use spaCy in the end? Would love to know how your project progressed!

Hello @Ducati1098s,

Yes i did use a custom tokenizer just like IgNoRaNt23 suggested, and it worked pretty well (at least to my domain, not sure if it will perform at a larger domain). It produced results which are pretty similar to this online tokenizer’s results: https://www.atilika.org/. So you can try playing with the online tokenizer to see if it fits your need.

This is the tokenizer which I integrated as an custom tokenizer for Rasa, at the time I was not able to use it on window so I had to run and test Rasa using docker instead (which is annoying). According to the description, it seems to be able to run on Window now: mecab-python3 · PyPI

P/S: I didn’t use SpaCy, but the supervised_embeddings pipeline and replaced the WhitespaceTokenizer with my CustomTokenizer.

1 Like

@fuih I will check out what’s been built! Will also experiment and plan to post more about Japanese language use and the quirks of languages with no white space or different characters etc.! Thanks for the reply!

— update ----

So I accessed the link and saw the application of the tokenizer for the search engine use!

Right now I’m specifically looking to build standard bot use but focused on the insurance and finance domain. There’a specific jargon that is truncating longer phrase kanji-combos into shorter phrases. 団体信用生命保険 → 団信

So I’m thinking that we’d need a custom tokenizer as well… but plan to test and learn.

Great to see about the company too.

@Ducati1098s, We are also working on Japanese language. Trying out below pipeline:

language: ja

pipeline:

  • name: HFTransformersNLP

    model_name: bert

    model_weights: cl-tohoku/bert-base-japanese

    cache_dir: /tmp

  • name: LanguageModelTokenizer

    intent_tokenization_flag: false

    intent_split_symbol: _

  • name: LanguageModelFeaturizer

  • name: RegexFeaturizer

  • name: LexicalSyntacticFeaturizer

  • name: CountVectorsFeaturizer

    analyzer: char_wb

    min_ngram: 1

    max_ngram: 4

  • name: DIETClassifier

    epochs: 100

  • name: EntitySynonymMapper

  • name: ResponseSelector epochs: 100

  • name: FallbackClassifier

    threshold: 0.3

    ambiguity_threshold: 0.1

Were you able to figure out some good pipeline?