[Entity Extractor] Spacy / Lookup table

Hi,

Currently I’m extracting the entity location with spacy LOC.
However, some small towns in France are not recognized.
So I looked at the lookup tables, I got all the french cities in a .txt file but I was unable to extract the entity.

Here is config.yml (734 Bytes) and a part of the nlu.md:

## lookup:cities
data/lookup_tables/french_cities.txt

## intent:inform_location
- Il va courir au [Havre](cities)
- Il marche dans la ville de [Saint Sulpice Verdon](cities)
- Il veut faire son jogging à [Vernon](cities)
- Je vais lui proposer de faire un footing à [Rouen](cities)
- pas loin de [beauvais](cities)
- A [Paris](cities) demain a 18h
- [Lille](cities)
- L'activité est à [Arradon](cities)
- [Arradon](cities)

I also noticed that regardless of the specified path to find “french_cities.txt” no error message was found, moreover the training time seems to me relatively short whereas this file has 36700 cities.

I’m using Rasa X with the docker compose manual install
Rasa X 0.31.0
Rasa SDK 1.10.2
Rasa 1.10.8

Hi Joseph!

I had the same doubt. Then, reading the docs NLU Training Data I saw that lookup tables use regex patterns.

“When lookup tables are supplied in training data, the contents are combined into a large, case-insensitive regex pattern that looks for exact matches in the training examples. […] These regexes are processed identically to the regular regex patterns directly specified in the training data.”

Then reading Regular Expression Features in the same page:

" Regex features for entity extraction are currently only supported by the CRFEntityExtractor component! Hence, other entity extractors, like MitieEntityExtractor or SpacyEntityExtractor won’t use the generated features and their presence will not improve entity recognition for these extractors. Currently, all intent classifiers make use of available regex features."

So I tryed using CRFEntityExtractor to extract these entities. I had a better result. But I would like to have a better explanation of if this is ok.

Tell me if you find something else, please!

Hi Flore,

To be honest I didn’t make the link between regex and lookup tables while reading the doc, but by reading an old article from 2018. Moreover, it mentioned an entity extractor “ner_crf”, so I added the following line:

- name: CRFEntityExtractor

to my config.yml, which I attached to the first post.

Despite this, the results are not good. I will try on a smaller sample and simplify the config.yml, to see if there is something interfering. I’ll keep you posted.

I simply removed everything that was related to the Spacy extractor, the city detection works very well, too well…

The entity is extracted twice in a row and I end up with a list of two elements containing twice the name of the city, while my entity “cities” is: type: unfeaturized.

Maybe I need to tag all the intent with use_entities/ignore_entities, but it seems to be quite laborious and I’m not sure it will solve the double extraction problem.

Here is what it looks like:

## New Story
* greet
    - utter_greet_which_intent
* request_activity
    - utter_ask_inform_physical_activity
* inform_location{"cities":"Arradon"}
    - slot{"cities":"Arradon"}
    - where_form
    - form{"name":"where_form"}
    - slot{"cities":"Arradon"}
    - slot{"cities":["Arradon","Arradon"]}
    - form{"name":null}
    - slot{"requested_slot":null}
    - action_environment_advice

On the other hand, city names that are too close together are rather poorly detected. He confuses all cities beginning with “Saint” with the city “Saint”.

Hi @JosephCHS! CRFEntityExtractor learn from the data. I saw this Components

“As the featurizer is moving over the tokens in a user message with a sliding window, you can define features for previous tokens, the current token, and the next tokens in the sliding window. You define the features as [before, token, after] array.”

And then show this by default

pipeline:
- name: "CRFEntityExtractor"
  # BILOU_flag determines whether to use BILOU tagging or not.
  "BILOU_flag": True
  # features to extract in the sliding window
  "features": [
    ["low", "title", "upper"],
    [
      "bias",
      "low",
      "prefix5",
      "prefix2",
      "suffix5",
      "suffix3",
      "suffix2",
      "upper",
      "title",
      "digit",
      "pattern",
    ],
    ["low", "title", "upper"],
  ]
  # The maximum number of iterations for optimization algorithms.
  "max_iterations": 50
  # weight of the L1 regularization
  "L1_c": 0.1
  # weight of the L2 regularization
  "L2_c": 0.1

Maybe overwritting this and excluding preffix or suffix it can differenciate cities beginning with “Saint” with the city “Saint”.

Just an idea to try. On the other hand if saint is the only problem, maybe you can do a validation to this.

Hello Joseph,

Did you find a solution to the double extraction problem ?

Thanks.

Hi, I had to clean my data. Sometimes when I use rasa x to validate the data, the entity extraction is done twice, but the highlighted text does not show this. So I manually clean up the data via my IDE. The problem does not get worse by doing this.