Handling multiple word entity

In domains such as medical, there are many entities that are composed of multiple tokens. In such entities, the meaning of individual tokens can be very different than the whole entity (multiple words).

When I run NLU, rasa recognizes a few of such entities, but in most cases, it splits the entity into multiple single token entities. Example: “Brake Pad” is splitter to “Brake” and “Pad”.

One of the reasons I think could be related to the word vector model, as the model does not have any vector for the whole entity. the model creates a vector model in such a case using subword level embedded vectors. As these vectors do not have much context-related information, the NER fails to do well.

I am thinking of creating a new vector model for these types of entities, such that multiple word entities have better vector representation.

My question is if we have a new vector model, do I need to do something else in Rasa to handle multiple word tokens. My thoughts (please correct me if you have another suggestion or comment):

  1. I think I will have to write a component to handle the new vector model.
  2. Do I need to do anything else to ad DIET classifier?

Thank you, Abhishek

Hi @ashek1520

Are you using both CRF & DIET for entity extraction? If you are using DIET, ensure you have enough training examples with multiple word entities and then increase the epochs in the config file.

DIET requires more training examples and more epochs in this case.

1 Like

Yes I am using DIET and CRF, Ok, I will try to add more examples for DIET to work with multiple word entities. Does the accuracy depend on what language model (word vector model) we are using and if the word vector model has vector for such ‘multiple word entities’?