Based on this it seems that you need different examples from the entities in your lookup table to be present in the nlu training data to teach the model that the lookup table is important in classifying entities, but my question is must every intent contain every example entity?
If that’s the case then I don’t understand the purpose of the lookup table since every example would already be in every intent. Unless the lookup table itself has no influence on the intent prediction, meaning that you could get away with putting all the example entities on one intent and have all the other intents have the same example entity value. Is this also the case?
The lookup table is processed as a large regular expression. So long as you’ve provided enough training data that it will recognize the unseen word as an entity (due to e.g. the words around it), the feature from the lookup table will be used.
To answer your question, any intent that uses that entity should have enough entity examples to make sure that unseen data is extracted as an intent. For example, say you have a lookup file of country names. You do not need to include all ~200 of them in every intent that extracts country, but you should have ~20 examples or so, ideally with varied entity values, so that you can teach your bot that “my home country is x” should extract x as a country. The same would go in another entity – you should provide varied examples so that if its, e.g. “i want to send it to x”, it will also do the same for this other intent.
I’ll leave this link here as additional reference. This passage mentions something important about the entity extraction aspect of lookup tables.
Regular expressions and lookup tables are adding additional features to ner_crf which mark whether a word was matched by a regular expression or lookup table entry. As it is one feature of many, the component ner_crf can still ignore an entity although it was matched, however in general ner_crf develops a bias for these features. Note that this can also stop the conditional random field from generalizing: if all entity examples in your training data are matched by a regular expression, the conditional random field will learn to focus on the regular expression feature and ignore the other features. If you then have a message with a certain entity which is not matched by the regular expression, ner_crf will probably not be able to detect it. Especially the use of lookup tables makes ner_crf prone for overfitting.
Will the ‘x’ being replaced as oov token if it is not seen in training data? If so, won’t that impact the intent prediction if there is a dedicated intent for handling text that is out of vocabulary? Thus, the predicted intent is not the expected intent like ‘QueryHomeCountryIntent’, but the wrong intent ‘OutOfScopeIntent’.
Is there a way to work around this?