Needs help with make entity with specail characters

Hello

First of all I’m Korean user and English is not my first language.

It can cause sevral typos, understand it please

I’m trying to make some Chatbot with rasa 3.2 using spacy “ko_core_news_lg” to make serve users ask about cource infos or some policy of in company classes I’ve got success on make Korean language intent classify but there is some problem with cource names entity

It contants many special charactors like “(”, “)”, “-”, “[”, “]” to make split corces categorys (ex. “(first optional locate) RPA training for first user(A360)”, “(second optional locate) RPA training for first user(A360)”, “(first optional locate) RPA training for expert user(A360)”)

my spacy tokenizer tokenize without special charactors and not contain it

I’m getting helped by Chat-GPT it suggest custom tokenizer

import re
import spacy

from rasa.engine.recipes.default_recipe import DefaultV1Recipe
from rasa.engine.storage.resource import Resource
from rasa.engine.graph import GraphComponent, ExecutionContext
from rasa.shared.nlu.training_data.message import Message
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.shared.nlu.constants import TEXT,TEXT_TOKENS
from typing import List

@DefaultV1Recipe.register(
    [DefaultV1Recipe.ComponentType.MESSAGE_TOKENIZER], is_trainable=False
)
class CustomSpecializeCharTokenizer(Tokenizer, GraphComponent):
    @staticmethod
    def get_default_config() -> dict:
        return {
            "case_sensitive": False,
            "spacy_model": "ko_core_news_lg"
        }
    
    def __init__(self,config: dict, name: str, resource: Resource) -> None:
        super().__init__(config)
        self.nlp = spacy.load(config["spacy_model"])
        self.case_sensitive = config["case_sensitive"]
        
    def process(self, message: Message) -> None:
        text = message.get(TEXT)
        
        if not self.case_sensitive:
            text = text.lower()

       #add some special charactor regex to make token
        
        tokens = self.tokenizer(text)
        tokens_list = [Token(t.text,t.idx) for t in tokens]
        message.set(TEXT_TOKENS, tokens_list)
    
    @classmethod
    def create(cls, config: dict, name: str, resource: Resource, execution_context:ExecutionContext) -> GraphComponent:
        return cls(config, name, resource)

GPT said it can tokenize some regex names tag for full entity names like my lookup table

but it has sevral problem

This cause SpacyFeaturizer not working I think this custom tokenizer contains SpacyTokenizer but Featurizer is not recognize the SpacyTokenize when included in my custom tokenizer

is there any whorng with my code or is there any good suggestion? I’m make it with specail characters cuase it has to make sure fully cource name to run RPA after get some request

help it please thanks to all