NLU: look-up tables how to handle aliases/spelling mistakes

Arno · April 2, 2019, 11:30am

Hey there folks,

I’ve been looking around a lot but can’t seem to find much information on this topic. Is there a methodology on how to handle aliases and spelling mistakes in for example names?

At the moment I have all names of employees in a look-up table, works amazingly well as long as you don’t misspell a word. For Aliases I’m thinking I can just add common aliases to the look-up table to recognize but for spelling mistakes I really have no clue how to handle it.

Thanks in advance for all the help!

Ghostvv · April 12, 2019, 4:12pm

either add spelling mistakes to your training data, or write a component to recognize them. Also you could try char level count vectorizer

netcarver · April 12, 2019, 8:14pm

I think this is a case where you could use your own pipeline component right at the start of your NLU pipeline. This component can use the fuzzywuzzy python library to match against a lookup table of staff names. If there is a clear winning match against one staff name, then it could just substitute that in to the input before passing it to the rest of the pipeline.

grandlogic · May 1, 2019, 1:48am

Something like fuzzywuzzy is fine to use for spelling errors and ad-hoc variations, but there are times when I know all the common aliases and the aliases might not necessarily be a close Levenshtein distance. For example I have list of drug names or medical procedures and I want them to be recognized/extracted entities. Many of these drugs and medical procedures have multiple aliases with a variety of spellings. Dialogflow handles this with they way it lets you define entities and aliases. I guess I could list them all in the entity lookup table and then map them myself to the canonical entity label, but would be nice if the entity lookup handled all that so it can be maintained in one place.

Ghostvv · May 2, 2019, 12:21pm

for entities you can define synonyms

akshay2000 · May 3, 2019, 1:31pm

How does this work though? What word would we match against the lookup table? Wouldn’t a misspelled entity not recognized at all?

Very simplified example could be this:
Lookup: [Akshay, Sam, Steve, Vladimir]
User input: My name is Askhay.

Here, misspelled entity Askhay won’t be extracted. How do we match it?

Arno · May 4, 2019, 2:01pm

What @akshay2000 entioned is exactly my questions as well, I would need to recognize entities being close enough to entities in the look-up table. Ideally this component would also replace the misspelled entity by the correctly spelled entity. I looked into the fuzzy tool and this offers me the ability to calculate the distance between two words but how will this work concretely within the RASA NLU?

@netcarver @grandlogic

netcarver · May 5, 2019, 4:28pm

@Arno

Thinking aloud: a potential approach would be to make a custom pipeline component that sits at the end of the NLU pipeline. Should the intent of the input map through to a specific intent such as /inform_staff_name that pulls out the name entity (even if it is spelled incorrectly), then the pipeline component could take the entity value and use FuzzyWuzzy’s process.extract() or process.extract_one() methods (link) to find the best match from a lookup table of allowed names and would then replace the extracted entity with the value from the lookup table.

Here’s a link to a gist that shows how to write a custom pipeline component that you could use as the basis of your own pipeline component.

akshay2000 · May 8, 2019, 5:10pm

I think one of us is missing something here. How do we extract the misspelled value in first place?

This part is not very clear. Entity extraction works independently of intent classification.

netcarver · May 8, 2019, 5:35pm

Hi Akshay,

The misspelled value of the entity will be made available to your pipeline processing method via the message parameter that is passed to it (cf. line 70 of the gist.) You can extract the misspelled value in a similar way to line 75 of the gist I posted above. You can then use this as the key you pass in to FuzzyWuzzy’s process.extract() method to get the closest correctly spelled value and then overwrite the entity value and adjust the start and end position.

I haven’t filtered based on the intent before - so I’m guessing that the extracted intent is passed as part of the message - you’ll need to research that yourself if you want to experiment with this idea.

Hope that helps.

akshay2000 · May 17, 2019, 10:32am

Which component is extracting this value? From what I have observed, ner_crf fails to extract the entity. I am clear about rest of the part. But I’m pretty sure ner_crf doesn’t extract the entities sometimes. Matching those to the exact values is trivial and can be even done at the actions end.

Serge · February 14, 2020, 3:27am

@Ghostvv This sounds like the correct answer for general spelling mistakes:

or write a component to recognize them.

Thank you @Juste for this:

And then use of synonym tables built in Rasa feature for entity synonym “iPhone” “iPhne” etc.

There also some good references in another thread from @tyd :

Not sure if this simple sample is good (can anyone confirm)?

gist.github.com

https://gist.github.com/netcarver/0ba957e6a52f6d1760db94d6bb4d8413

TitlecaseNamedEntities.py

# -*- coding: utf-8 -*-

from rasa_nlu.components import Component


"""
About
-----

The Spacy tokenizer uses lowercase by default, so training an NLU model using

This file has been truncated. show original

Topic		Replies	Views
Training NLU Models with only Lookups and Synonyms Rasa Open Source	0	230	January 30, 2023
How can i use a lookup table por my entity? Rasa Open Source	4	815	October 10, 2022
Issue on entity detection using lookup table and entitysynonymmapper together Rasa Open Source	6	1500	October 19, 2020
Lookup table entity assignment not working Rasa Open Source	2	275	April 20, 2021
Entities are recognized even though they aren't included in lookup table Rasa Open Source	1	304	May 1, 2020

NLU: look-up tables how to handle aliases/spelling mistakes

Related topics