I have a custom spell checker component which was at the start of my pipeline, but I just notice that during evaluating I have the following errors :
2019-01-21 15:11:59 DEBUG __main__ - Token boundary error for token mari(32, 36) and entity {'start': 33, 'end': 37, 'value': 'mari', 'entity': 'famille'}
2019-01-21 15:11:59 DEBUG __main__ - Token boundary error for token pédicure(21, 30) and entity {'start': 21, 'end': 29, 'value': 'pédicure', 'entity': 'specialite_medecine_douce'}
2019-01-21 15:11:59 DEBUG __main__ - Token boundary error for token shiatsu(34, 41) and entity {'start': 35, 'end': 42, 'value': 'shiatsu', 'entity': 'specialite_medecine_douce'}
2019-01-21 15:11:59 DEBUG __main__ - Token boundary error for token homéopathie(14, 26) and entity {'start': 11, 'end': 24, 'value': "l'homéopathie", 'entity': 'specialite_medecine_douce'}
The major problem is that my spell checker doesn’t handle d’ l’ s’ m’ and such, which are really common in french (but I can’t add all the words having them in my dictionnary). So when the correct is made and the tokeniser is used, the entity start/end change and I have the previous warning/error.
CRF does not seem disturbed by this, since it still finds the entity, but it’s bothering me.
I thought of a solution but idk if I can really do this : training and evaluating without the spell checker but using it during production ? Could it works or will I have some bugs since it’s not the same pipeline ?