Using FuzzyWuzzy with lookup tables

ap_rasa · November 29, 2019, 2:06am

Based on this blog post, Entity extraction with the new lookup table feature in Rasa NLU, I am wondering if I am using the FuzzyWuzzy library correctly.

Basically when my custom component is processing the message. It iterates through the message tokens and utilizes the FuzzyWuzzy (GitHub - seatgeek/fuzzywuzzy: Fuzzy String Matching in Python) library to search through the lookup table. If it finds anything above a certain threshold, it will return the found entity.

Is this what the blog post had in mind or should I generally be doing something else to allow typos in lookup tables?

stephens · November 29, 2019, 5:42pm

Anthony, that’s great that you’ve created FuzzyWuzzy component. It sounds like the right approach. Can you share your component? You can see examples of entity extractors in the Rasa project here. The Spacy entity extractor is a good example.

Have you tried running an evaluation of the results of your FuzzyWuzzy extractor vs. a lookup table?

ap_rasa · November 30, 2019, 4:15pm

In its current use, my Fuzzy component is able to catch stuff that the lookup table does not find but without messing with a few things, it will find too many extra entities. My main point of concern though is how it is being used. In its current implementation, its more of a brute force component just doing a search over a list on top of any CRF Entity Extraction. I was wondering if the blog post mentioned above had something else in mind.

My implementation is as follows:

class FuzzyExtractor(Component):
    name = "FuzzyExtractor"
    provides = ["entities"]
    requires = ["tokens"]
    defaults = {}
    language_list  ["en"]
    threshold = 90

    def __init__(self, component_config=None, *args):
        super(FuzzyExtractor, self).__init__(component_config)

    def train(self, training_data, cfg, **kwargs):
        pass

    def process(self, message, **kwargs):

        entities = list(message.get('entities'))

        # Get file path of lookup table in json format
        cur_path = os.path.dirname(__file__)
        if os.name == 'nt':
            partial_lookup_file_path = '..\\data\\lookup_master.json'
        else:
            partial_lookup_file_path = '../data/lookup_master.json'
        lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)

        with open(lookup_file_path, 'r') as file:
            lookup_data = json.load(file)['data']

            tokens = message.get('tokens')

            for token in tokens:

                # STOP_WORDS is just a dictionary of stop words from NLTK
                if token.text not in STOP_WORDS:

                    fuzzy_results = process.extract(
                                             token.text, 
                                             lookup_data, 
                                             processor=lambda a: a['value'] 
                                                 if isinstance(a, dict) else a, 
                                             limit=10)

                    for result, confidence in fuzzy_results:
                        if confidence >= self.threshold:
                            entities.append({
                                "start": token.offset,
                                "end": token.end,
                                "value": token.text,
                                "fuzzy_value": result["value"],
                                "confidence": confidence,
                                "entity": result["entity"]
                            })

        file.close()

        message.set("entities", entities, add_to_output=True)

stephens · December 1, 2019, 5:31pm

This looks correct to me. Although you would want to load the lookup table in init call before putting this in production.

isic5 · January 13, 2020, 4:44pm

@ap_rasa Hi Anthony, I am trying to do pretty much the same thing, that you did here already, so thanks for providing this code first and foremost, as it will speed up my development time a lot. Did you have any other learnings trying to implement this? Did you do any changes or ran into issues? Any input is greatly appreciated!

ap_rasa · January 14, 2020, 1:50pm

Hey, the only issues that we had to deal with was latency and essentially the FuzzyWuzzy library finding too many items.

In the implementation above, you see that it searches through all qualified tokens that were found. Depending on the message coming in, this might be a little too slow so I had to experiment around with deciding which word truly needed to be searched in addition to the stop words set (in the code its a dictionary). I am currently experimenting around with the idea of only searching through entities that were found at this point. This means your models have to be good enough to find the generally correct entities in the first place.

The next issue was finding too many candidates once the fuzzy search executes. You can tweak the threshold score built into the library but I also made a brevity penalty that got added on to the threshold score to come to a final value. Pretty much the length of the fuzzy result minus the length of the searched value.

Hope this helps.

siva_g · April 17, 2020, 9:43am

Hi @stephens ,

Actually i am new one to Rasa. i having some doubts about NER entity extraction with lookup table.

for example today i have some product(Apple, strawberry, pineapple) in my MySQL product table, today i have created intent and entities with above mentioned 3 products which i am having in my database.

but in daily cases i will add new products(Milk, Rice) in MySQL table, how can i identify new product as entity. how can i achieve this??

Please give me some suggestion. @ap_rasa

stephens · April 17, 2020, 6:39pm

Hi Siva,

You could have a process that is run at a regular interval or triggered when a SQL update occurs. The process would export the list of products to a lookup table and then kick off model training and updating of the bot.

Greg

siva_g · April 18, 2020, 12:24pm

Thanks for your response @stephens, i’ll follow your steps. thanks once again.

saimanoj2826 · May 7, 2021, 12:09pm

Can anyone show me your config file with fuzzywuzzy using in it. i don’t know where to place it in the pipeline.

Topic		Replies	Views
Entity recognition with lookup tables and fuzzy matching Rasa Open Source	4	1070	December 17, 2021
How to add Fuzzy matching to the Entity Extraction? Rasa Open Source	1	2457	August 19, 2019
How to use FuzzyWuzzy in NLU pipeline Rasa Open Source	3	1383	February 17, 2019
Getting TypeError: 'NoneType' object is not iterable Rasa Open Source	1	623	January 18, 2021
An experiment with lookup table Rasa Open Source	2	2118	May 25, 2022

Using FuzzyWuzzy with lookup tables

Related topics