In domains such as medical, there are many entities that are composed of multiple tokens. In such entities, the meaning of individual tokens can be very different than the whole entity (multiple words).
When I run NLU, rasa recognizes a few of such entities, but in most cases, it splits the entity into multiple single token entities. Example: “Brake Pad” is splitter to “Brake” and “Pad”.
One of the reasons I think could be related to the word vector model, as the model does not have any vector for the whole entity. the model creates a vector model in such a case using subword level embedded vectors. As these vectors do not have much context-related information, the NER fails to do well.
I am thinking of creating a new vector model for these types of entities, such that multiple word entities have better vector representation.
My question is if we have a new vector model, do I need to do something else in Rasa to handle multiple word tokens. My thoughts (please correct me if you have another suggestion or comment):
- I think I will have to write a component to handle the new vector model.
- Do I need to do anything else to ad DIET classifier?
Thank you, Abhishek