Token boundary errors with spell-checking

I have a custom spell checker component which was at the start of my pipeline, but I just notice that during evaluating I have the following errors :

2019-01-21 15:11:59 DEBUG    __main__  - Token boundary error for token mari(32, 36) and entity {'start': 33, 'end': 37, 'value': 'mari', 'entity': 'famille'}
2019-01-21 15:11:59 DEBUG    __main__  - Token boundary error for token pédicure(21, 30) and entity {'start': 21, 'end': 29, 'value': 'pédicure', 'entity': 'specialite_medecine_douce'}
2019-01-21 15:11:59 DEBUG    __main__  - Token boundary error for token shiatsu(34, 41) and entity {'start': 35, 'end': 42, 'value': 'shiatsu', 'entity': 'specialite_medecine_douce'}
2019-01-21 15:11:59 DEBUG    __main__  - Token boundary error for token homéopathie(14, 26) and entity {'start': 11, 'end': 24, 'value': "l'homéopathie", 'entity': 'specialite_medecine_douce'}

The major problem is that my spell checker doesn’t handle d’ l’ s’ m’ and such, which are really common in french (but I can’t add all the words having them in my dictionnary). So when the correct is made and the tokeniser is used, the entity start/end change and I have the previous warning/error.

CRF does not seem disturbed by this, since it still finds the entity, but it’s bothering me.

I thought of a solution but idk if I can really do this : training and evaluating without the spell checker but using it during production ? Could it works or will I have some bugs since it’s not the same pipeline ?

Hi huberrom. Did you get any solution to this issue?

I am trying to create a spellchecker where my custom component requires tokens and after processing it returns corrected tokens for the next component to work. I think I will also face the same issue as you were facing 7 months back.