Tokenizer for language without space such as Japanese

fuih · October 28, 2019, 1:49am

Hello,

I’m adding a Japanese support function to my bot, which means i have to clone my Rasa bot to make a new one in Japanese. I’m using the default pipeline “supervised_embedding” which use the Whitespace_Tokenizer. This tokenizer can not be used with language with no space such as Japanese, Chinese,… So i wonder if there is a tokenizer from Rasa that i can use to deal with this kind of language.

I read from the docs that the Jieba Tokenizer supports Chinese, can it be use on Japanese too or i have to make a custom tokenizer ? If the latter is true, does any one know a reliable tokenizer for Japanese that can be easily integrated with Rasa. Thank you.

IgNoRaNt23 · October 28, 2019, 5:24am

A quick google search showed that spaCy got a japansese tokenizer.

Even found a guide Setting up Japanese NLP with spaCy and MeCab for setup.

fuih · October 28, 2019, 6:50am

@IgNoRaNt23 Awesome ! Although it seems like i will have to install various old version dependencies (as the guide suggest), which probably will cause some problems because of my limited experience and knowledge. Nonetheless, it is definitely a potential solution to my problem.

I have one more question, after installing all the required components, and i can use spaCy for Japanese, do i have to create a custom component for the tokenizer or i can just use the pretrained_embeddings_spacy like this:

language: "jp"

pipeline: "pretrained_embeddings_spacy"

Thank you for your help.

IgNoRaNt23 · October 28, 2019, 7:01am

You will need a custom tokenizer.

Ducati1098s · September 27, 2020, 8:23pm

Hi, did you eventually find a way to deal with the Japanese tokenizer? Did you use spaCy in the end? Would love to know how your project progressed!

fuih · September 28, 2020, 6:26am

Hello @Ducati1098s,

Yes i did use a custom tokenizer just like IgNoRaNt23 suggested, and it worked pretty well (at least to my domain, not sure if it will perform at a larger domain). It produced results which are pretty similar to this online tokenizer’s results: https://www.atilika.org/. So you can try playing with the online tokenizer to see if it fits your need.

This is the tokenizer which I integrated as an custom tokenizer for Rasa, at the time I was not able to use it on window so I had to run and test Rasa using docker instead (which is annoying). According to the description, it seems to be able to run on Window now: mecab-python3 · PyPI

P/S: I didn’t use SpaCy, but the supervised_embeddings pipeline and replaced the WhitespaceTokenizer with my CustomTokenizer.

Ducati1098s · September 28, 2020, 7:34pm

@fuih I will check out what’s been built! Will also experiment and plan to post more about Japanese language use and the quirks of languages with no white space or different characters etc.! Thanks for the reply!

— update ----

So I accessed the link and saw the application of the tokenizer for the search engine use!

Right now I’m specifically looking to build standard bot use but focused on the insurance and finance domain. There’a specific jargon that is truncating longer phrase kanji-combos into shorter phrases. 団体信用生命保険 → 団信

So I’m thinking that we’d need a custom tokenizer as well… but plan to test and learn.

Great to see about the company too.

abu · October 20, 2020, 8:22am

@Ducati1098s, We are also working on Japanese language. Trying out below pipeline:

language: ja

pipeline:

name: HFTransformersNLP

model_name: bert

model_weights: cl-tohoku/bert-base-japanese

cache_dir: /tmp
name: LanguageModelTokenizer

intent_tokenization_flag: false

intent_split_symbol: _
name: LanguageModelFeaturizer
name: RegexFeaturizer
name: LexicalSyntacticFeaturizer
name: CountVectorsFeaturizer

analyzer: char_wb

min_ngram: 1

max_ngram: 4
name: DIETClassifier

epochs: 100
name: EntitySynonymMapper
name: ResponseSelector epochs: 100
name: FallbackClassifier

threshold: 0.3

ambiguity_threshold: 0.1

Were you able to figure out some good pipeline?

Topic		Replies	Views
Does Rasa support Japanese? Rasa Open Source	4	1845	October 26, 2018
Configure Rasa pipeline for Thai Language Rasa Open Source	0	133	April 18, 2024
Adding a tokenizer to a predefined pipeline(for languages like Chinese) Rasa Open Source	1	602	May 21, 2019
How to use Japanese Text with Rasa (Mecab-Tokenization) Rasa Open Source	3	1750	July 4, 2019
Rasa is also good for languages other than English? Rasa Open Source	2	1412	September 19, 2019

Tokenizer for language without space such as Japanese

Related topics