I have 3 entities and for each entity, I have a list of defined items. I used the lookup table but I did not get good results in spite of all of the mentioned entities in the user utterances are exist in the tables. Therefore, I want to check the performance of the synonyms features.
In my bot, I have an entity called cities which includes the city name and its synonyms. For examples
Value: New York , Synonyms: NYC and newyork
Value: Los Angelos , Synonyms: LA and losangelos
My question is: do I need to include all the entity values or their synonyms in the training data? For example:
book me a ticket to nyc
reserve me a ticket to New York
.
.
.
.
Do I need to include the ‘Los Angelos’ in the training set in order to define the entity name? Or is there something missing me?
I’m not sure, I understood correctly, what you mean. synonyms are used to override the value of the picked entity after ner, in order to increase performace it is better to include as many possible variations as possible
Let me rephrase the problem, I have 2 entities: 1- CityNames 2- Dishes.
Initially, I used the lookup table features since I have a specific list of the defined entities. Unfortunately, the results were not that good
Therefore, I switched to use the Synonyms entity. For example, in the CityName entity I have the following:
Value: New York , Synonyms: NYC and newyork
Value: Los Angelos , Synonyms: LA and losangelos
.
.
.
My question in the training set, do I need to include all the cities for all the values (e.g. nyc and LA …etc. )? If yes, then the list of utterances in the training for the intents that include CityName entity will be greater than those intents without entities. Consequently, the model will be biased? I hope I clarified the issue
But if I am going to train the model for the synonyms of all values that I have, most probably I will get an overfitted model. Does this sound correct or there is something missing me?
might be, but if you are going to include all values, there is nothing left to overfit to. It is better to try it first, before making hypothetical conclusions
@Ghostvv I trained the model on 13 intents. 9 of them have the cityNames as entity. So, I trained the model on all intents. And, for those intents that had cityNames as entity, I included all the values (with their synonyms) in the training set. For example,
“book a ticket to LA” (and I list all the synonyms of LA)
“book a ticket to NYC” (and I list all the synonyms of NYC)
“book a ticket to FL” (and I list all the synonyms of FL)
and so on
When I tested the model, it was too biased because the training set of those intents (that had the CityNames entities) are large since their examples are duplicated except for the entity (in order to include all the cityNames values) as shown in the example above.