I can’t give you a definite solution but there are several things you could try.
Given that one of the entities is country, you could use Duckling to extract that one and use DIET to extract company names (meaning that you would only annotate company names in your NLU data, not country names). Of course, DIET can still make mistakes and this will heavily depend on the annotated entity examples you train it on… Strictly speaking, there’s never a guarantee that the model will pick up an entity correctly if it hasn’t seen it during training.
Let me know if you have further questions. I’m also very curious if the Duckling/DIET “division of labour” trick will help you, please share your insights
Thanks for the suggestion. Actually there are a few things:
the number of company examples are around 2.5lakhs which i have as a lookup table for the model to train upon
out of these 2.5 lakhs around 1lakh account names have within them the country name in some form, e.g Google Canada is actually the name of the company
the country names are around 100, again present as a lookup table
so the model now tends to take names like canada as a company name rather than a country as simply there are way more company names containing canada as compared to country name.
Add to it is the fact that the user will often tend to not give the company names in the exact format. eg: If the name is ABC USA CORP LTD. someone just puts ABC USA LTD
Also by the way did you mean spacy to extarct company names and not duckling? Beacuse spacy only recognizes popular names of ORG.
Just eyeballing the numbers (which, as I understand, stand for the number of unique names), there are indeed many more company names than country names, and I’d expect DIET to pick up country names as company names if it sees companies way more often during training. Still, in this case, the issue is not that there are relatively few countries, but that there are relatively few intent examples containing country names. Perhaps you could try to address this by generating more examples with countries but not with companies? (Or by including some of these examples more than once in your training data.)
Going back to the original idea I proposed (using different entity extractors for the two types): I meant extracting company names with DIET, not with Duckling or Spacy. Country (which is the simpler type) would be extracted using an off-the-shelf extractor. Last time I mentioned Duckling but that’s my bad. Could you possibly try using Spacy for the country name and DIET for company names? I think it’s easier than generating additional training examples or duplicating them in training.