Crf_entity_extractor with spell checker

I have problem when using spell checker where the token boundary is changed and crf_entity_extractor accuracy decreased . can i change entity boundary after changing training text with spell checker text ?

(problem example): Token boundary error for token (المسافرين ) (start:61, end: 71) and entity {‘start’: 64, ‘end’: 75, ‘value’: ‘المسافرين’, ‘entity’: ‘departure’}

After spell checker for training_example became: Token boundary for token (المسافرين ) (start:61, end: 71) and entity can’t catch correct token

How solve this problem?

Our class for checker:

class TextCleaning(Component):

provides = ['text'] 
language_list = None
def __init__(self, component_config=None):
    super(TextCleaning, self).__init__(component_config)
    self.arabic_diacritics = re.compile("""
                        ّ    | # Tashdid
                        َّ    | # Fatha
                        ً    | # Tanwin Fath
                        ُ    | # Damma
                        ٌ    | # Tanwin Damm
                        ِ    | # Kasra
                        ٍ    | # Tanwin Kasr
                        ْ    | # Sukun
                        ـ     # Tatwil/Kashida
                        """, re.VERBOSE)

def train(self, training_data, cfg, **kwargs):
    pass

def process(self, message, **kwargs):
    
    text_remove_dica= self.remove_diacritics(message.text) 
    text_normalize= self.normalize_arabic(text_remove_dica)
    text= self.remove_repeating_char(text_normalize)
    message.text= text
    logging.debug("messag.text in clean_text {}".format( message.text))

def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
    pass

@classmethod
def load(
    cls,
    meta: Dict[Text, Any],
    model_dir: Optional[Text] = None,
    model_metadata: Optional["Metadata"] = None,
    cached_component: Optional["Component"] = None,
    **kwargs: Any
) -> "Component":
    """##Load this component from file."""

    if cached_component:
        return cached_component
    else:
        return cls(meta)

def normalize_arabic(self,text):
    text = re.sub("[إأآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    text = re.sub("گ", "ك", text)
    return text

def remove_diacritics(self,text):
    text = re.sub(self.arabic_diacritics, '', text)
    return text

def remove_repeating_char(self,text):
    return re.sub(r'(.)\1+', r'\1', text)

Could you also add your config.yml?

pipeline:

  • name: nlu.Textblob_ar.Textblob_AR
  • name: CountVectorsFeaturizer
  • name: EmbeddingIntentClassifier
  • BILOU_flag: true name: nlu.crf_entity_extractor.CRFEntityExtractor features:
    • [low, title, upper]
    • [low, bias, prefix5, prefix2, suffix5, suffix3, suffix2, upper, title, digit, pattern]
    • [low, title, upper]
  • name: rasa_addons.nlu.components.gazette.Gazette
  • name: EntitySynonymMapper

Is that the entire config.yml file? Also could you wrap in ticks (```) such that the code renders nicely?

provides = ['text'] 
language_list = None
def __init__(self, component_config=None):
    super(TextCleaning, self).__init__(component_config)
    self.arabic_diacritics = re.compile("""
                        ّ    | # Tashdid
                        َّ    | # Fatha
                        ً    | # Tanwin Fath
                        ُ    | # Damma
                        ٌ    | # Tanwin Damm
                        ِ    | # Kasra
                        ٍ    | # Tanwin Kasr
                        ْ    | # Sukun
                        ـ     # Tatwil/Kashida
                        """, re.VERBOSE)

def train(self, training_data, cfg, **kwargs):
    pass

def process(self, message, **kwargs):
    
    text_remove_dica= self.remove_diacritics(message.text) 
    text_normalize= self.normalize_arabic(text_remove_dica)
    text= self.remove_repeating_char(text_normalize)
    message.text= text
    logging.debug("messag.text in clean_text {}".format( message.text))

def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
    pass

@classmethod
def load(
    cls,
    meta: Dict[Text, Any],
    model_dir: Optional[Text] = None,
    model_metadata: Optional["Metadata"] = None,
    cached_component: Optional["Component"] = None,
    **kwargs: Any
) -> "Component":
    """##Load this component from file."""

    if cached_component:
        return cached_component
    else:
        return cls(meta)

def normalize_arabic(self,text):
    text = re.sub("[إأآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    text = re.sub("گ", "ك", text)
    return text

def remove_diacritics(self,text):
    text = re.sub(self.arabic_diacritics, '', text)
    return text

def remove_repeating_char(self,text):
    return re.sub(r'(.)\1+', r'\1', text)

That reads like it might be part of the actions.py file, but not the config.yml file. Could you send the latter?

thanks for your answer . i know that and i write this part in custom component not in config.yml
.my config.yml :

pipeline:

  • name: nlu.Textblob_ar.Textblob_AR
  • name: CountVectorsFeaturizer
  • name: EmbeddingIntentClassifier
  • BILOU_flag: true name: nlu.crf_entity_extractor.CRFEntityExtractor features:
    • [low, title, upper]
    • [low, bias, prefix5, prefix2, suffix5, suffix3, suffix2, upper, title, digit, pattern]
    • [low, title, upper]
  • name: rasa_addons.nlu.components.gazette.Gazette
  • name: EntitySynonymMapper

policies:

  • name: FallbackPolicy
  • name: AugmentedMemoizationPolicy
  • name: MemoizationPolicy

language:ar

I don’t recognize nlu.Textblob_ar.Textblob_AR. Is it from the Rasa codebase or from another library?

Is it something made with this library?

Ok, nlu.Textblob_ar.Textblob_AR is made with this for arabic language in my nlu package to make some reprocessing in our data

Ahhh, sorry I think only now I understand the problem. You’ve written a custom component that reads in text and applies a spellcheck, after which the words change and the entity detection can no longer match the entities correctly.

Thinking out loud, you are currently using a EntitySynonymMapper, but I wonder what happens if you remove it? Does that work? In that case you might avoid the issue by not requiring correct spellings. Or does your system have that as a hard requirement?

It also seems that your pipeline contains both a CRFEntityExtractor as well as a EntitySynonymMapper. If I read the documentation then it seems like you may want to be careful there; the EntitySynonymMapper will modify existing entities meaning that it may wrongly correct entities found by the CRFEntityExtractor.

If you’re interested in experimenting, I might also recommending trying out the DIETClassifier. This will detect both intents as well as entities.

Can you share me basic example of entity extraction using regex

Check out this Rasa Master Class tutorial by @Juste.

For your query you can directly refer to Time Stamp - 3:05