Best way to distinguish between close entities

I have 2 entities:

  1. one is a list of countries, about 100 or so in the entity country. Eg: USA, India etc.
  2. Another one is a list of companies about 50,000 names in the entity company

The challenge is that quite a few company contains name like: Google India, ABC CORP CHINA INC, etc.

now entity identification often picks up Google India as Google > Company and India > Country or even India as a company name.

What is the best way to separately identify these 2 entities with reasonable accuracy?

Hey Dip,

I can’t give you a definite solution but there are several things you could try.

Given that one of the entities is country, you could use Duckling to extract that one and use DIET to extract company names (meaning that you would only annotate company names in your NLU data, not country names). Of course, DIET can still make mistakes and this will heavily depend on the annotated entity examples you train it on… Strictly speaking, there’s never a guarantee that the model will pick up an entity correctly if it hasn’t seen it during training.

Let me know if you have further questions. I’m also very curious if the Duckling/DIET “division of labour” trick will help you, please share your insights :slight_smile:

Hi Sam,

Thanks for the suggestion. Actually there are a few things:

  1. the number of company examples are around 2.5lakhs which i have as a lookup table for the model to train upon
  2. out of these 2.5 lakhs around 1lakh account names have within them the country name in some form, e.g Google Canada is actually the name of the company
  3. the country names are around 100, again present as a lookup table

so the model now tends to take names like canada as a company name rather than a country as simply there are way more company names containing canada as compared to country name.

Add to it is the fact that the user will often tend to not give the company names in the exact format. eg: If the name is ABC USA CORP LTD. someone just puts ABC USA LTD

Also by the way did you mean spacy to extarct company names and not duckling? Beacuse spacy only recognizes popular names of ORG.

Hi Dip,

Just eyeballing the numbers (which, as I understand, stand for the number of unique names), there are indeed many more company names than country names, and I’d expect DIET to pick up country names as company names if it sees companies way more often during training. Still, in this case, the issue is not that there are relatively few countries, but that there are relatively few intent examples containing country names. Perhaps you could try to address this by generating more examples with countries but not with companies? (Or by including some of these examples more than once in your training data.)

Going back to the original idea I proposed (using different entity extractors for the two types): I meant extracting company names with DIET, not with Duckling or Spacy. Country (which is the simpler type) would be extracted using an off-the-shelf extractor. Last time I mentioned Duckling but that’s my bad. Could you possibly try using Spacy for the country name and DIET for company names? I think it’s easier than generating additional training examples or duplicating them in training.