Needs help with make entity with specail characters

Zarrha · October 4, 2024, 4:11am

Hello

First of all I’m Korean user and English is not my first language.

It can cause sevral typos, understand it please

I’m trying to make some Chatbot with rasa 3.2 using spacy “ko_core_news_lg” to make serve users ask about cource infos or some policy of in company classes I’ve got success on make Korean language intent classify but there is some problem with cource names entity

It contants many special charactors like “(”, “)”, “-”, “[”, “]” to make split corces categorys (ex. “(first optional locate) RPA training for first user(A360)”, “(second optional locate) RPA training for first user(A360)”, “(first optional locate) RPA training for expert user(A360)”)

my spacy tokenizer tokenize without special charactors and not contain it

I’m getting helped by Chat-GPT it suggest custom tokenizer

import re
import spacy

from rasa.engine.recipes.default_recipe import DefaultV1Recipe
from rasa.engine.storage.resource import Resource
from rasa.engine.graph import GraphComponent, ExecutionContext
from rasa.shared.nlu.training_data.message import Message
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.shared.nlu.constants import TEXT,TEXT_TOKENS
from typing import List

@DefaultV1Recipe.register(
    [DefaultV1Recipe.ComponentType.MESSAGE_TOKENIZER], is_trainable=False
)
class CustomSpecializeCharTokenizer(Tokenizer, GraphComponent):
    @staticmethod
    def get_default_config() -> dict:
        return {
            "case_sensitive": False,
            "spacy_model": "ko_core_news_lg"
        }
    
    def __init__(self,config: dict, name: str, resource: Resource) -> None:
        super().__init__(config)
        self.nlp = spacy.load(config["spacy_model"])
        self.case_sensitive = config["case_sensitive"]
        
    def process(self, message: Message) -> None:
        text = message.get(TEXT)
        
        if not self.case_sensitive:
            text = text.lower()

       #add some special charactor regex to make token
        
        tokens = self.tokenizer(text)
        tokens_list = [Token(t.text,t.idx) for t in tokens]
        message.set(TEXT_TOKENS, tokens_list)
    
    @classmethod
    def create(cls, config: dict, name: str, resource: Resource, execution_context:ExecutionContext) -> GraphComponent:
        return cls(config, name, resource)

GPT said it can tokenize some regex names tag for full entity names like my lookup table

but it has sevral problem

This cause SpacyFeaturizer not working I think this custom tokenizer contains SpacyTokenizer but Featurizer is not recognize the SpacyTokenize when included in my custom tokenizer

is there any whorng with my code or is there any good suggestion? I’m make it with specail characters cuase it has to make sure fully cource name to run RPA after get some request

help it please thanks to all

Topic		Replies	Views
Questions of Rasa with Spacy Rasa Open Source	2	359	November 23, 2023
NLU not predicting entities separated by the '/' character in the new version of Rasa. Why? Rasa Open Source	3	497	June 11, 2020
How to make my chatbot recognise names which where not given in the entity values? Rasa Open Source	3	1237	November 30, 2020
Rasa's SpacyEntityExtractor does not work well with lowercase inputs Rasa Open Source	7	756	March 27, 2021
Spacy alpha tokenization language support Getting Started with Rasa	1	137	January 18, 2019

Needs help with make entity with specail characters

Related topics