Token boundary errors with spell-checking

huberrom · January 21, 2019, 2:20pm

I have a custom spell checker component which was at the start of my pipeline, but I just notice that during evaluating I have the following errors :

2019-01-21 15:11:59 DEBUG    __main__  - Token boundary error for token mari(32, 36) and entity {'start': 33, 'end': 37, 'value': 'mari', 'entity': 'famille'}
2019-01-21 15:11:59 DEBUG    __main__  - Token boundary error for token pédicure(21, 30) and entity {'start': 21, 'end': 29, 'value': 'pédicure', 'entity': 'specialite_medecine_douce'}
2019-01-21 15:11:59 DEBUG    __main__  - Token boundary error for token shiatsu(34, 41) and entity {'start': 35, 'end': 42, 'value': 'shiatsu', 'entity': 'specialite_medecine_douce'}
2019-01-21 15:11:59 DEBUG    __main__  - Token boundary error for token homéopathie(14, 26) and entity {'start': 11, 'end': 24, 'value': "l'homéopathie", 'entity': 'specialite_medecine_douce'}

The major problem is that my spell checker doesn’t handle d’ l’ s’ m’ and such, which are really common in french (but I can’t add all the words having them in my dictionnary). So when the correct is made and the tokeniser is used, the entity start/end change and I have the previous warning/error.

CRF does not seem disturbed by this, since it still finds the entity, but it’s bothering me.

I thought of a solution but idk if I can really do this : training and evaluating without the spell checker but using it during production ? Could it works or will I have some bugs since it’s not the same pipeline ?

tonysinghmss · July 23, 2019, 5:48am

Hi huberrom. Did you get any solution to this issue?

I am trying to create a spellchecker where my custom component requires tokens and after processing it returns corrected tokens for the next component to work. I think I will also face the same issue as you were facing 7 months back.

Topic		Replies	Views
Crf_entity_extractor with spell checker Rasa Open Source	11	822	September 24, 2020
[HELP WANTED] Error in Custom Components pipeline Rasa Open Source	1	986	October 22, 2019
After using SpacyTokenizer: Misaligned entity annotation error when using CRFEntityExtraction Rasa Open Source	0	1062	February 24, 2020
Auto spelling correct the words before interpreter in rasa core Rasa Open Source	9	1204	February 5, 2019
Adding Spell Checker as a custom graph component Rasa Open Source	4	567	June 25, 2022

Token boundary errors with spell-checking

Related topics