Hi guys! I keep getting this extractor error during nlu training.
rasa_nlu.extractors.crf_entity_extractor - Misaligned entity annotation in sentence ‘What was today’s Attendance?’. Make sure the start and end values of the annotated training examples end at token boundaries (e.g. don’t include trailing whitespaces or punctuation).
I tried to remove whitespaces and everything. Does this affect entity prediction or intent classification?
I think I understand the problem. It is related to tokenization and the fact that the indices of the entity do not fall exactly on token boundaries. You either need to change tokenization such that the entity will be marked on the edges of a token always, or you have to change the indices of the entity to include the character which is in the same token as the entity but not marked down as such. Does it make sense?
example of zip code(with white space tokenization):
good: I am from this zip code 93333
bad: I am from this zip code,93333