I am getting the following warning during training the model:
“UserWarning: Misaligned entity annotation in message…” and eventually entities are not recognized correctly.
The problem disappears when I do not use stopwords. It seems that entity alignement is compared to the alignement of the token after stopword removal or something similar. How can I override this problem? Is there some sort of configuration needed?
Hello Vladimir,
When a user sends a message the message goes through the pipeline. The first step is a tokenizer. Inside the tokenizer and before creating the tokens, every word that is a stopword is removed from the message.
entities are aligned with tokens based on the index of a first and last character in the input text. My guess would be that in your custom component after removing stop words, you created a discrepancy between tokens and input text leading to entity misalignment
Hello Vladimir, yes this is the case because the entity after stopword removal has a different position than the initial. What I don’t understand is the following: how do we remove stopwords then? Do remove them but still keep the tokens with their positions as “empty” to tokens somehow?
I’m not sure here, sorry I didn’t look at the code for quite some time, but I think as soon as you keep the word offset of the tokens to correspond to original index in the text, it should work