How to configure Rasa Regex to prefer longer match?

Hi there, I’m trying out using lookup tables to detect cities. In my lookups table, I have both “batu” (from Indonesia) and “batu caves” (from Malaysia) as examples. I’m using RegexEntityExtractor in my pipeline

Once the model is trained, I typed in: “What’s the weather in Batu Caves?

However, the model decided to extract the entity “batu” instead of “batu caves”.

Is there a way to customize the regex behaviour to prefer longer matches?

Update - Here is what my training data looks like:

image

Inside my cities.yml:

image

The ‘batu’ section I was referring to:

Update: I SOLVED IT! :smiley:

So what I did was sort my lookups in descending order such that the ones with the most number of characters were at the top of the list. Now regex featurizers automatically picks the ones with the longest characters!

It’s not your traditional workaround, but it somehow works. For someone curious to know how I managed to sort it in descending character length, here’s what I did:

  1. I converted the lookup.yml file into a .csv file by changing the extension name
  2. I opened up the .csv file in excel
  3. I removed the version and nlu: lookup:city examples:| headers
  4. I used this method to sort the characters by length, deleted the length column, then saved the file
  5. I saved the csv file, then renamed it to .yml
  6. I re-added the headers with notepad and saved using UTF-8 format (since notepad++ gave me issues for some reason)
  7. Popped the file back into the data folder, trained my nlu model, and everything works like a charm!

Hope this managed to help someone out!

2 Likes