Using FuzzyWuzzy with lookup tables

Based on this blog post, Entity extraction with the new lookup table feature in Rasa NLU, I am wondering if I am using the FuzzyWuzzy library correctly.

Basically when my custom component is processing the message. It iterates through the message tokens and utilizes the FuzzyWuzzy (GitHub - seatgeek/fuzzywuzzy: Fuzzy String Matching in Python) library to search through the lookup table. If it finds anything above a certain threshold, it will return the found entity.

Is this what the blog post had in mind or should I generally be doing something else to allow typos in lookup tables?

Anthony, that’s great that you’ve created FuzzyWuzzy component. It sounds like the right approach. Can you share your component? You can see examples of entity extractors in the Rasa project here. The Spacy entity extractor is a good example.

Have you tried running an evaluation of the results of your FuzzyWuzzy extractor vs. a lookup table?

In its current use, my Fuzzy component is able to catch stuff that the lookup table does not find but without messing with a few things, it will find too many extra entities. My main point of concern though is how it is being used. In its current implementation, its more of a brute force component just doing a search over a list on top of any CRF Entity Extraction. I was wondering if the blog post mentioned above had something else in mind.

My implementation is as follows:

class FuzzyExtractor(Component):
    name = "FuzzyExtractor"
    provides = ["entities"]
    requires = ["tokens"]
    defaults = {}
    language_list  ["en"]
    threshold = 90

    def __init__(self, component_config=None, *args):
        super(FuzzyExtractor, self).__init__(component_config)

    def train(self, training_data, cfg, **kwargs):
        pass

    def process(self, message, **kwargs):

        entities = list(message.get('entities'))

        # Get file path of lookup table in json format
        cur_path = os.path.dirname(__file__)
        if os.name == 'nt':
            partial_lookup_file_path = '..\\data\\lookup_master.json'
        else:
            partial_lookup_file_path = '../data/lookup_master.json'
        lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)

        with open(lookup_file_path, 'r') as file:
            lookup_data = json.load(file)['data']

            tokens = message.get('tokens')

            for token in tokens:

                # STOP_WORDS is just a dictionary of stop words from NLTK
                if token.text not in STOP_WORDS:

                    fuzzy_results = process.extract(
                                             token.text, 
                                             lookup_data, 
                                             processor=lambda a: a['value'] 
                                                 if isinstance(a, dict) else a, 
                                             limit=10)

                    for result, confidence in fuzzy_results:
                        if confidence >= self.threshold:
                            entities.append({
                                "start": token.offset,
                                "end": token.end,
                                "value": token.text,
                                "fuzzy_value": result["value"],
                                "confidence": confidence,
                                "entity": result["entity"]
                            })

        file.close()

        message.set("entities", entities, add_to_output=True)

This looks correct to me. Although you would want to load the lookup table in init call before putting this in production.

@ap_rasa Hi Anthony, I am trying to do pretty much the same thing, that you did here already, so thanks for providing this code first and foremost, as it will speed up my development time a lot. :slight_smile: Did you have any other learnings trying to implement this? Did you do any changes or ran into issues? Any input is greatly appreciated!

Hey, the only issues that we had to deal with was latency and essentially the FuzzyWuzzy library finding too many items.

In the implementation above, you see that it searches through all qualified tokens that were found. Depending on the message coming in, this might be a little too slow so I had to experiment around with deciding which word truly needed to be searched in addition to the stop words set (in the code its a dictionary). I am currently experimenting around with the idea of only searching through entities that were found at this point. This means your models have to be good enough to find the generally correct entities in the first place.

The next issue was finding too many candidates once the fuzzy search executes. You can tweak the threshold score built into the library but I also made a brevity penalty that got added on to the threshold score to come to a final value. Pretty much the length of the fuzzy result minus the length of the searched value.

Hope this helps.

Hi @stephens ,

Actually i am new one to Rasa. i having some doubts about NER entity extraction with lookup table.

for example today i have some product(Apple, strawberry, pineapple) in my MySQL product table, today i have created intent and entities with above mentioned 3 products which i am having in my database.

but in daily cases i will add new products(Milk, Rice) in MySQL table, how can i identify new product as entity. how can i achieve this??

Please give me some suggestion. @ap_rasa

Hi Siva,

You could have a process that is run at a regular interval or triggered when a SQL update occurs. The process would export the list of products to a lookup table and then kick off model training and updating of the bot.

Greg

Thanks for your response @stephens, i’ll follow your steps. thanks once again.

Can anyone show me your config file with fuzzywuzzy using in it. i don’t know where to place it in the pipeline.