Rasa for Japanese language

Hello, I need to build a chatbot for Japanese language and based on previous related posts on the Rasa forum, I tried using mecab-python3.

The code shared on mahbubcseju github repo gave errors with Rasa 2.0, and the custom tokenizer needed editing. The custom tokenizer code that worked for me for intents with no Entities

import typing
from typing import Any, Optional, Text, Dict, List, Type
import re
from rasa.nlu.components import Component
from rasa.nlu.config import RasaNLUModelConfig
from rasa.shared.nlu.training_data.training_data import TrainingData
from rasa.shared.nlu.training_data.message import Message
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.nlu.constants import TOKENS_NAMES, MESSAGE_ATTRIBUTES
from rasa.shared.nlu.constants import (
    INTENT, INTENT_RESPONSE_KEY, RESPONSE_IDENTIFIER_DELIMITER, ACTION_NAME,
)
import MeCab

class JapaneseTokenizer(Tokenizer, Component):

    provides = ["tokens"]

    def train(
        self, training_data: TrainingData, config: RasaNLUModelConfig, 
        **kwargs: Any,
    ) -> None:
        for example in training_data.training_examples:
            try:
                text_string = example.data['text']
            except :
                text_string = ""
            example.set("tokens", self.tokenize(text_string))

    def process(self, message: Message, **kwargs: Any) -> None:
        pass
        
    @staticmethod
    def tokenize(text: Text) -> List[Token]:
        mt = MeCab.Tagger("-Owakati")
        parsed = mt.parse(text)
        print("\nParsed - ",parsed)
        words = []
        for i in parsed.split():
            words.append(i)
        running_offset=0
        tokens = []
        for word in words:
            word_offset = text.index(word, running_offset)
            word_len = len(word)
            running_offset = word_offset + word_len
            tokens.append(Token(word, word_offset))
        return tokens

Config.yml used

language: jp
pipeline:
  - name: JapaneseTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: word
    min_ngram: 1
    max_ngram: 4
  - name: CRFEntityExtractor
  - name: KeywordIntentClassifier
  - name: EntitySynonymMapper
  - name: FallbackClassifier
    threshold: 0.3
    ambiguity_threshold: 0.1
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
    constrain_similarities: true
  - name: RulePolicy
    core_fallback_threshold: 0.4
    core_fallback_action_name: "action_default_fallback"
    enable_fallback_prediction: True
    constrain_similarities: True

But on adding entities in my training data, it starts giving errors

File "/Users/mac/.../lib/python3.7/site-packages/rasa/nlu/extractors/extractor.py", line 451, in check_correct_entity_annotations
    token_start_positions = [t.start for t in example.get(TOKENS_NAMES[TEXT])]
TypeError: 'NoneType' object is not iterable

Would appreciate any ideas or suggestions on what needs to be corrected ? Or if there are any blogs or documentation for Japanese tokenizer with Rasa 2.0 ?

@Vin Hello (Kon’nichiwa :slight_smile: ) Please you can ref this blog : https://towardsdatascience.com/multi-lingual-chatbot-using-rasa-and-custom-tokenizer-7aeb2346e36b

OR

I hope this will help you!!

1 Like

@nik202
Thanks for the blog links. I had tried both of them but was getting some errors. On closer inspection the errors for the 1st blog were due to Rasa version 1.0 format, which I edited out and tried again and it’s working.

2nd blog which I had posted the query about is still unresolved (gonna happily take the other option that works)

Thanks (ArigatĹŤgozaimashita :slight_smile: ) again.

PS: The 1st one, sudachipy blog worked for me, using the below edits

japanese_tokenizer.py

import re
import typing
from typing import Any, Optional, Text, Dict, List, Type

from rasa.nlu.components import Component
from rasa.nlu.config import RasaNLUModelConfig
from rasa.shared.nlu.training_data.training_data import TrainingData
from rasa.shared.nlu.training_data.message import Message
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.nlu.constants import TOKENS_NAMES, MESSAGE_ATTRIBUTES
from rasa.shared.nlu.constants import (
    INTENT,
    INTENT_RESPONSE_KEY,
    RESPONSE_IDENTIFIER_DELIMITER,
    ACTION_NAME,
)

#class SudachiTokenizer(Tokenizer):
class JapaneseTokenizer(Tokenizer):
    provides = [TOKENS_NAMES[attribute] for attribute in MESSAGE_ATTRIBUTES]

    defaults = {
        # Flag to check whether to split intents
        "intent_tokenization_flag": False,
        # Symbol on which intent should be split
        "intent_split_symbol": "_",
    }

    def __init__(self, component_config: Dict[Text, Any] = None) -> None:
        super().__init__(component_config)

        from sudachipy import dictionary
        from sudachipy import tokenizer

        self.tokenizer_obj = dictionary.Dictionary().create()
        self.mode = tokenizer.Tokenizer.SplitMode.A

    @classmethod
    def required_packages(cls) -> List[Text]:
        return ["sudachipy"]

    #def tokenize(self, text: Text) -> List[Token]:
    def tokenize(self, message: Message, attribute: Text) -> List[Token]:
        text = message.get(attribute)
        words = [m.surface() for m in self.tokenizer_obj.tokenize(text, self.mode)]

        return self._convert_words_to_tokens(words, text)

@Vin Congrts and glad both of my suggestions (link) helped you to archived your goal. Yeah, its obviously need some updates :slight_smile: Good Luck!

@Vin Thanks for sharing. Would you please let me know about , What was the rasa version worked with the code you’ve given?

@Krishnasubedi I used the following - Rasa Version : 2.6.2 Minimum Compatible Version : 2.6.0 Rasa SDK Version : 2.6.0 Rasa X Version : 0.41.1 Python Version : 3.7.6

@Vin Vinamra Thanks a lot for your kindness.