I’ve been looking around a lot but can’t seem to find much information on this topic. Is there a methodology on how to handle aliases and spelling mistakes in for example names?
At the moment I have all names of employees in a look-up table, works amazingly well as long as you don’t misspell a word. For Aliases I’m thinking I can just add common aliases to the look-up table to recognize but for spelling mistakes I really have no clue how to handle it.
I think this is a case where you could use your own pipeline component right at the start of your NLU pipeline. This component can use the fuzzywuzzy python library to match against a lookup table of staff names. If there is a clear winning match against one staff name, then it could just substitute that in to the input before passing it to the rest of the pipeline.
Something like fuzzywuzzy is fine to use for spelling errors and ad-hoc variations, but there are times when I know all the common aliases and the aliases might not necessarily be a close Levenshtein distance. For example I have list of drug names or medical procedures and I want them to be recognized/extracted entities. Many of these drugs and medical procedures have multiple aliases with a variety of spellings. Dialogflow handles this with they way it lets you define entities and aliases. I guess I could list them all in the entity lookup table and then map them myself to the canonical entity label, but would be nice if the entity lookup handled all that so it can be maintained in one place.
What @akshay2000 entioned is exactly my questions as well, I would need to recognize entities being close enough to entities in the look-up table. Ideally this component would also replace the misspelled entity by the correctly spelled entity. I looked into the fuzzy tool and this offers me the ability to calculate the distance between two words but how will this work concretely within the RASA NLU?
Thinking aloud: a potential approach would be to make a custom pipeline component that sits at the end of the NLU pipeline. Should the intent of the input map through to a specific intent such as /inform_staff_name that pulls out the name entity (even if it is spelled incorrectly), then the pipeline component could take the entity value and use FuzzyWuzzy’s process.extract() or process.extract_one() methods (link) to find the best match from a lookup table of allowed names and would then replace the extracted entity with the value from the lookup table.
Here’s a link to a gist that shows how to write a custom pipeline component that you could use as the basis of your own pipeline component.
The misspelled value of the entity will be made available to your pipeline processing method via the message parameter that is passed to it (cf. line 70 of the gist.) You can extract the misspelled value in a similar way to line 75 of the gist I posted above. You can then use this as the key you pass in to FuzzyWuzzy’s process.extract() method to get the closest correctly spelled value and then overwrite the entity value and adjust the start and end position.
I haven’t filtered based on the intent before - so I’m guessing that the extracted intent is passed as part of the message - you’ll need to research that yourself if you want to experiment with this idea.
Which component is extracting this value? From what I have observed, ner_crf fails to extract the entity. I am clear about rest of the part. But I’m pretty sure ner_crf doesn’t extract the entities sometimes. Matching those to the exact values is trivial and can be even done at the actions end.