Misaligned entity annotation

Hi guys! I keep getting this extractor error during nlu training.

rasa_nlu.extractors.crf_entity_extractor - Misaligned entity annotation in sentence ‘What was today’s Attendance?’. Make sure the start and end values of the annotated training examples end at token boundaries (e.g. don’t include trailing whitespaces or punctuation).

I tried to remove whitespaces and everything. Does this affect entity prediction or intent classification?

can attach the screenshot of your training data?

Here

1 Like

Is there an update on this? I am still getting those when using the rasa nlu trainer to generate the training data for nlu training.

1 Like

I think I understand the problem. It is related to tokenization and the fact that the indices of the entity do not fall exactly on token boundaries. You either need to change tokenization such that the entity will be marked on the edges of a token always, or you have to change the indices of the entity to include the character which is in the same token as the entity but not marked down as such. Does it make sense?

example of zip code(with white space tokenization): good: I am from this zip code 93333 bad: I am from this zip code,93333

I encounter this problem too, but I’m using jieba_tokenizer for chinese. It got fixed after I put all my entity intot jieba’s user dict.

I had this problem, once. My issue was i was using “.” in the entity training file.

like :

What is the [employee id.](id_details) of [kumar saurav](name) 

i removed the “.” after employee id and it worked perfectly.

changed code was:

What is the [employee id](id_details) of [kumar saurav](name)

JiebaTokenizer for Chinese too. But when I put it in jieba user_dict, the problem still exist. rasa 1.10