NLU: look-up tables how to handle aliases/spelling mistakes

Hey there folks,

I’ve been looking around a lot but can’t seem to find much information on this topic. Is there a methodology on how to handle aliases and spelling mistakes in for example names?

At the moment I have all names of employees in a look-up table, works amazingly well as long as you don’t misspell a word. For Aliases I’m thinking I can just add common aliases to the look-up table to recognize but for spelling mistakes I really have no clue how to handle it.

Thanks in advance for all the help!

either add spelling mistakes to your training data, or write a component to recognize them. Also you could try char level count vectorizer

I think this is a case where you could use your own pipeline component right at the start of your NLU pipeline. This component can use the fuzzywuzzy python library to match against a lookup table of staff names. If there is a clear winning match against one staff name, then it could just substitute that in to the input before passing it to the rest of the pipeline.

1 Like

Something like fuzzywuzzy is fine to use for spelling errors and ad-hoc variations, but there are times when I know all the common aliases and the aliases might not necessarily be a close Levenshtein distance. For example I have list of drug names or medical procedures and I want them to be recognized/extracted entities. Many of these drugs and medical procedures have multiple aliases with a variety of spellings. Dialogflow handles this with they way it lets you define entities and aliases. I guess I could list them all in the entity lookup table and then map them myself to the canonical entity label, but would be nice if the entity lookup handled all that so it can be maintained in one place.

for entities you can define synonyms

How does this work though? What word would we match against the lookup table? Wouldn’t a misspelled entity not recognized at all?

Very simplified example could be this:
Lookup: [Akshay, Sam, Steve, Vladimir]
User input: My name is Askhay.

Here, misspelled entity Askhay won’t be extracted. How do we match it?

What @akshay2000 entioned is exactly my questions as well, I would need to recognize entities being close enough to entities in the look-up table. Ideally this component would also replace the misspelled entity by the correctly spelled entity. I looked into the fuzzy tool and this offers me the ability to calculate the distance between two words but how will this work concretely within the RASA NLU?

@netcarver @grandlogic

@Arno

Thinking aloud: a potential approach would be to make a custom pipeline component that sits at the end of the NLU pipeline. Should the intent of the input map through to a specific intent such as /inform_staff_name that pulls out the name entity (even if it is spelled incorrectly), then the pipeline component could take the entity value and use FuzzyWuzzy’s process.extract() or process.extract_one() methods (link) to find the best match from a lookup table of allowed names and would then replace the extracted entity with the value from the lookup table.

Here’s a link to a gist that shows how to write a custom pipeline component that you could use as the basis of your own pipeline component.

1 Like

I think one of us is missing something here. How do we extract the misspelled value in first place?

This part is not very clear. Entity extraction works independently of intent classification.

Hi Akshay,

The misspelled value of the entity will be made available to your pipeline processing method via the message parameter that is passed to it (cf. line 70 of the gist.) You can extract the misspelled value in a similar way to line 75 of the gist I posted above. You can then use this as the key you pass in to FuzzyWuzzy’s process.extract() method to get the closest correctly spelled value and then overwrite the entity value and adjust the start and end position.

I haven’t filtered based on the intent before - so I’m guessing that the extracted intent is passed as part of the message - you’ll need to research that yourself if you want to experiment with this idea.

Hope that helps.

Which component is extracting this value? From what I have observed, ner_crf fails to extract the entity. I am clear about rest of the part. But I’m pretty sure ner_crf doesn’t extract the entities sometimes. Matching those to the exact values is trivial and can be even done at the actions end.

@Ghostvv This sounds like the correct answer for general spelling mistakes:

or write a component to recognize them.

Thank you @Juste for this:

And then use of synonym tables built in Rasa feature for entity synonym “iPhone” “iPhne” etc.

There also some good references in another thread from @tyd :

Not sure if this simple sample is good (can anyone confirm)?