Hello
First of all I’m Korean user and English is not my first language.
It can cause sevral typos, understand it please
I’m trying to make some Chatbot with rasa 3.2 using spacy “ko_core_news_lg” to make serve users ask about cource infos or some policy of in company classes I’ve got success on make Korean language intent classify but there is some problem with cource names entity
It contants many special charactors like “(”, “)”, “-”, “[”, “]” to make split corces categorys (ex. “(first optional locate) RPA training for first user(A360)”, “(second optional locate) RPA training for first user(A360)”, “(first optional locate) RPA training for expert user(A360)”)
my spacy tokenizer tokenize without special charactors and not contain it
I’m getting helped by Chat-GPT it suggest custom tokenizer
import re
import spacy
from rasa.engine.recipes.default_recipe import DefaultV1Recipe
from rasa.engine.storage.resource import Resource
from rasa.engine.graph import GraphComponent, ExecutionContext
from rasa.shared.nlu.training_data.message import Message
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.shared.nlu.constants import TEXT,TEXT_TOKENS
from typing import List
@DefaultV1Recipe.register(
[DefaultV1Recipe.ComponentType.MESSAGE_TOKENIZER], is_trainable=False
)
class CustomSpecializeCharTokenizer(Tokenizer, GraphComponent):
@staticmethod
def get_default_config() -> dict:
return {
"case_sensitive": False,
"spacy_model": "ko_core_news_lg"
}
def __init__(self,config: dict, name: str, resource: Resource) -> None:
super().__init__(config)
self.nlp = spacy.load(config["spacy_model"])
self.case_sensitive = config["case_sensitive"]
def process(self, message: Message) -> None:
text = message.get(TEXT)
if not self.case_sensitive:
text = text.lower()
#add some special charactor regex to make token
tokens = self.tokenizer(text)
tokens_list = [Token(t.text,t.idx) for t in tokens]
message.set(TEXT_TOKENS, tokens_list)
@classmethod
def create(cls, config: dict, name: str, resource: Resource, execution_context:ExecutionContext) -> GraphComponent:
return cls(config, name, resource)
GPT said it can tokenize some regex names tag for full entity names like my lookup table
but it has sevral problem
This cause SpacyFeaturizer not working I think this custom tokenizer contains SpacyTokenizer but Featurizer is not recognize the SpacyTokenize when included in my custom tokenizer
is there any whorng with my code or is there any good suggestion? I’m make it with specail characters cuase it has to make sure fully cource name to run RPA after get some request
help it please thanks to all