My aim was to add the look up tables as I don’t have a lot of intents and in order to avoid overfitting I thought it would make more sense to add them to improve on entity extraction.
Also as stated in the documentation I saw that if you mis-spell words in the loop up table they are not extracted, instead when I remove the loop up tables they can be detected despite the writing error.
I’m adding look up tables related to names of hospitals and names of cities, each city has a unique name. I could add these values into sentences in training data however I would have to insert them along with a sentence, I thought that it was bad to continuously insert the same sentence over and over again as it could lead to an “overfitting” for that intent, the bot would only recognise that intent with that particular sentence, I’m not sure if that’s correct.
This is a paragraph found in the article about Look Up tables:
Entity extraction with the new look up feature in Rasa NLU:
“As designed right now, lookup tables only match phrases when an exact match is found. However, this is a potential problem when dealing with typos, different word endings (like pluralization), and other sources of noise in your data. “Fuzzy matching” is a promising alternative to manually adding each of the possible variations of each entity to the lookup table."
I’m not totally familiar with Rasa Core’s implementation, but I believe the following happens to you:
If you don’t use lookup tables, ner_crf will be somewhat successful to generalize some of your city and hospital names if you provide enough samples. As you mentioned though, if you had 10k cities, it would be a bad idea to add 10k sample utterances. To avoid this, you should create your few dozen training samples with well chosen variation, make sure you e.g. don’t use names twice, and that you get just enough, but not too much sentence structure variation as well such that the algorithm can pickup what’s really important
If you choose to add a lookup table, essentially a binary feature will be added for every item that indicates whether the token matches that pattern. Therefore, if you have only a few sample utterances in your training file, but a large list of names, I guess the algorithm learns that these binary features are way more important than the sentence structure etc., thus, it will stop working with spelling mistakes and all items that are not on the table.
I’d probably go without a lookup table unless you really have everything you need to extract in a list. If you had all the options in a list, what you could do is:
Make sure you remove troublesome examples, e.g. if a hospital is named (HEALTH), which might coincide with non-hospital entities
Create common spelling mistakes yourself, e.g. you could also add XYZ hopsital to your list
If you don’t use the lookup table, you could try and optimize for sentence structure, e.g. by adding POS features to ner_crf, and maybe implement some logic in a custom action that checks if the entity extracted “fuzzy” matches an item on your list
Thanks, I was just checking, my pipeline does not include the look up tables now after I realised it didn’t work well with the NER-CRF component.
Are you saying that if I add more training sentences the entity extraction component would be able identify with better accuracy new values that are not contained in the look up tables?
I noticed a couple of things from testing the two pipelines, if I added the look up tables even if I made spelling error when writing the names of the cities (e.g. Torinu (instead of Torino)) the which was one of the cities I wrote in one of my utterences the NLU would not recognise it.
Instead without the look up tables it could.
Could you explain further about the POS features? My entity extractor is working quite well now however I am working with a small number of Intents and Entities and I may need to add it later on.
If you choose your training samples wisely, I think it could improve your accuracy. Though I don’t know how well your training samples are chosen already. E.g. one simple thing to look out for is to not include the same name in your samples multiple times, such that the algorithm doesn’t pick up that specific name, but rather the overall intent.
That is what I was trying to say, by adding lookup tables, the algorithm starts “memorizing” the values from the list, so it gets good at recognizing “Torino”, but since “Torinu” is not on the list, it can’t learn it. Whereas without lookup tables, your algorithm probably does a better job at not only paying attention to the token, but also sentence structure.
What I mean you can add Part-of-Speech (POS) tags to your ner_crf that may help pick up cities from sentence structure rather than memorization. Instead of having this in your config:
You could append “pos” and “pos2” after “pattern” to include POS tags for your tokens. For example, then ner_crf might learn to look out for proper nouns (names).
Thanks alot, that helped clarify things, I did use various cities in my training sentences however I think there was not a lot of variety between the training sentences.
Since we are trying to generate our training data automatically from templates in order to be able to simply upload query chatbots easily for several different data bases I don’t think it will be easy to generate a wide variety of sentences. We did introduce an algorithm to filter out repetitions in our sentence generator though to help remove sentences that were very similar, I’m not sure if that would enough but for now it seems to work.
I’m just trying to understand what would be the best pipeline to use for an Italina language NLU.
My data will be imported automatically onto a database which will use templates to create a training set. It’s not the best but we are trying our hardest to develop a chabot that can do basic tasks without needing to be trained.
I have a pipeline that I am currently going to stick too, however if NER-CRF can work efficiently with look up tables I will try to include them later on.
Did you recognized one behaviour that mostly you have to put in sentences where both context words have to match your test sentences such that entity is recognized too? If just the word before or after matches it barely works… I just use the ngrams of those words as features (not the entity itself) hoping of generalization…
I thought of something like that too. What are you looking for? Do do count frequencies of the context words for NER_CRF or do topic modelling for intent algorithm?
@akelad could you please varify the statements of @smn-snkl according to the look-up table? I.e like for a large list vs small number of examples in training data, the algo learns more to rely on the look up table instead of other features?
“Do do count frequencies of the context words for NER_CRF or do topic modelling for intent algorithm?”
For now my main concern is the NER_CRF, the algorithm is something simple that seems to be quite effective that is implemented outside the pipeline before introducing the training data into the chatbot to be used for training. How do you count the context words for the NER_CRF?
It doesn’t count anything, it simply tranforms a template of values (for example a sentence structure divided into four sections with different values for each section) into a matrice form which removes duplicates when they are found.