Crf_entity_extractor with spell checker

mohamd_saeed106 · April 7, 2020, 3:26pm

I have problem when using spell checker where the token boundary is changed and crf_entity_extractor accuracy decreased . can i change entity boundary after changing training text with spell checker text ?

(problem example): Token boundary error for token (المسافرين ) (start:61, end: 71) and entity {‘start’: 64, ‘end’: 75, ‘value’: ‘المسافرين’, ‘entity’: ‘departure’}

After spell checker for training_example became: Token boundary for token (المسافرين ) (start:61, end: 71) and entity can’t catch correct token

How solve this problem?

Our class for checker:

class TextCleaning(Component):

provides = ['text'] 
language_list = None
def __init__(self, component_config=None):
    super(TextCleaning, self).__init__(component_config)
    self.arabic_diacritics = re.compile("""
                        ّ    | # Tashdid
                        َّ    | # Fatha
                        ً    | # Tanwin Fath
                        ُ    | # Damma
                        ٌ    | # Tanwin Damm
                        ِ    | # Kasra
                        ٍ    | # Tanwin Kasr
                        ْ    | # Sukun
                        ـ     # Tatwil/Kashida
                        """, re.VERBOSE)

def train(self, training_data, cfg, **kwargs):
    pass

def process(self, message, **kwargs):
    
    text_remove_dica= self.remove_diacritics(message.text) 
    text_normalize= self.normalize_arabic(text_remove_dica)
    text= self.remove_repeating_char(text_normalize)
    message.text= text
    logging.debug("messag.text in clean_text {}".format( message.text))

def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
    pass

@classmethod
def load(
    cls,
    meta: Dict[Text, Any],
    model_dir: Optional[Text] = None,
    model_metadata: Optional["Metadata"] = None,
    cached_component: Optional["Component"] = None,
    **kwargs: Any
) -> "Component":
    """##Load this component from file."""

    if cached_component:
        return cached_component
    else:
        return cls(meta)

def normalize_arabic(self,text):
    text = re.sub("[إأآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    text = re.sub("گ", "ك", text)
    return text

def remove_diacritics(self,text):
    text = re.sub(self.arabic_diacritics, '', text)
    return text

def remove_repeating_char(self,text):
    return re.sub(r'(.)\1+', r'\1', text)

koaning · April 17, 2020, 10:15am

Could you also add your config.yml?

mohamd_saeed106 · April 21, 2020, 8:23am

pipeline:

name: nlu.Textblob_ar.Textblob_AR
name: CountVectorsFeaturizer
name: EmbeddingIntentClassifier
BILOU_flag: true name: nlu.crf_entity_extractor.CRFEntityExtractor features:
- [low, title, upper]
- [low, bias, prefix5, prefix2, suffix5, suffix3, suffix2, upper, title, digit, pattern]
- [low, title, upper]
name: rasa_addons.nlu.components.gazette.Gazette
name: EntitySynonymMapper

koaning · April 21, 2020, 8:38am

Is that the entire config.yml file? Also could you wrap in ticks (```) such that the code renders nicely?

mohamd_saeed106 · April 21, 2020, 10:23am

provides = ['text'] 
language_list = None
def __init__(self, component_config=None):
    super(TextCleaning, self).__init__(component_config)
    self.arabic_diacritics = re.compile("""
                        ّ    | # Tashdid
                        َّ    | # Fatha
                        ً    | # Tanwin Fath
                        ُ    | # Damma
                        ٌ    | # Tanwin Damm
                        ِ    | # Kasra
                        ٍ    | # Tanwin Kasr
                        ْ    | # Sukun
                        ـ     # Tatwil/Kashida
                        """, re.VERBOSE)

def train(self, training_data, cfg, **kwargs):
    pass

def process(self, message, **kwargs):
    
    text_remove_dica= self.remove_diacritics(message.text) 
    text_normalize= self.normalize_arabic(text_remove_dica)
    text= self.remove_repeating_char(text_normalize)
    message.text= text
    logging.debug("messag.text in clean_text {}".format( message.text))

def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
    pass

@classmethod
def load(
    cls,
    meta: Dict[Text, Any],
    model_dir: Optional[Text] = None,
    model_metadata: Optional["Metadata"] = None,
    cached_component: Optional["Component"] = None,
    **kwargs: Any
) -> "Component":
    """##Load this component from file."""

    if cached_component:
        return cached_component
    else:
        return cls(meta)

def normalize_arabic(self,text):
    text = re.sub("[إأآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    text = re.sub("گ", "ك", text)
    return text

def remove_diacritics(self,text):
    text = re.sub(self.arabic_diacritics, '', text)
    return text

def remove_repeating_char(self,text):
    return re.sub(r'(.)\1+', r'\1', text)

koaning · April 21, 2020, 12:17pm

That reads like it might be part of the actions.py file, but not the config.yml file. Could you send the latter?

mohamd_saeed106 · April 26, 2020, 8:37am

thanks for your answer . i know that and i write this part in custom component not in config.yml
.my config.yml :

pipeline:

name: nlu.Textblob_ar.Textblob_AR
name: CountVectorsFeaturizer
name: EmbeddingIntentClassifier
BILOU_flag: true name: nlu.crf_entity_extractor.CRFEntityExtractor features:
- [low, title, upper]
- [low, bias, prefix5, prefix2, suffix5, suffix3, suffix2, upper, title, digit, pattern]
- [low, title, upper]
name: rasa_addons.nlu.components.gazette.Gazette
name: EntitySynonymMapper

policies:

name: FallbackPolicy
name: AugmentedMemoizationPolicy
name: MemoizationPolicy

language:ar

koaning · April 29, 2020, 7:14am

I don’t recognize nlu.Textblob_ar.Textblob_AR. Is it from the Rasa codebase or from another library?

Is it something made with this library?

mohamd_saeed106 · May 5, 2020, 12:46pm

Ok, nlu.Textblob_ar.Textblob_AR is made with this for arabic language in my nlu package to make some reprocessing in our data

koaning · May 6, 2020, 7:40am

Ahhh, sorry I think only now I understand the problem. You’ve written a custom component that reads in text and applies a spellcheck, after which the words change and the entity detection can no longer match the entities correctly.

Thinking out loud, you are currently using a EntitySynonymMapper, but I wonder what happens if you remove it? Does that work? In that case you might avoid the issue by not requiring correct spellings. Or does your system have that as a hard requirement?

It also seems that your pipeline contains both a CRFEntityExtractor as well as a EntitySynonymMapper. If I read the documentation then it seems like you may want to be careful there; the EntitySynonymMapper will modify existing entities meaning that it may wrongly correct entities found by the CRFEntityExtractor.

If you’re interested in experimenting, I might also recommending trying out the DIETClassifier. This will detect both intents as well as entities.

abhishek1 · September 23, 2020, 3:03pm

Can you share me basic example of entity extraction using regex

gaushh · September 24, 2020, 5:18am

Check out this Rasa Master Class tutorial by @Juste.

For your query you can directly refer to Time Stamp - 3:05

Topic		Replies	Views
Token boundary errors with spell-checking Rasa Open Source	1	485	July 23, 2019
Misaligned entity annotation Rasa Open Source	7	4614	June 3, 2020
CrfExtractor Pipeline Rasa Open Source	3	341	March 19, 2021
After using SpacyTokenizer: Misaligned entity annotation error when using CRFEntityExtraction Rasa Open Source	0	1050	February 24, 2020
Hindi entity extraction. Tokenizer issue Rasa Open Source	2	629	June 11, 2020

Crf_entity_extractor with spell checker

Related topics