[Entity Extractor] Spacy / Lookup table

JosephCHS · July 20, 2020, 4:02pm

Hi,

Currently I’m extracting the entity location with spacy LOC.
However, some small towns in France are not recognized.
So I looked at the lookup tables, I got all the french cities in a .txt file but I was unable to extract the entity.

Here is config.yml (734 Bytes) and a part of the nlu.md:

## lookup:cities
data/lookup_tables/french_cities.txt

## intent:inform_location
- Il va courir au [Havre](cities)
- Il marche dans la ville de [Saint Sulpice Verdon](cities)
- Il veut faire son jogging à [Vernon](cities)
- Je vais lui proposer de faire un footing à [Rouen](cities)
- pas loin de [beauvais](cities)
- A [Paris](cities) demain a 18h
- [Lille](cities)
- L'activité est à [Arradon](cities)
- [Arradon](cities)

I also noticed that regardless of the specified path to find “french_cities.txt” no error message was found, moreover the training time seems to me relatively short whereas this file has 36700 cities.

I’m using Rasa X with the docker compose manual install
Rasa X 0.31.0
Rasa SDK 1.10.2
Rasa 1.10.8

flore · July 20, 2020, 7:44pm

Hi Joseph!

I had the same doubt. Then, reading the docs NLU Training Data I saw that lookup tables use regex patterns.

“When lookup tables are supplied in training data, the contents are combined into a large, case-insensitive regex pattern that looks for exact matches in the training examples. […] These regexes are processed identically to the regular regex patterns directly specified in the training data.”

Then reading Regular Expression Features in the same page:

" Regex features for entity extraction are currently only supported by the CRFEntityExtractor component! Hence, other entity extractors, like MitieEntityExtractor or SpacyEntityExtractor won’t use the generated features and their presence will not improve entity recognition for these extractors. Currently, all intent classifiers make use of available regex features."

So I tryed using CRFEntityExtractor to extract these entities. I had a better result. But I would like to have a better explanation of if this is ok.

Tell me if you find something else, please!

JosephCHS · July 21, 2020, 7:46am

Hi Flore,

To be honest I didn’t make the link between regex and lookup tables while reading the doc, but by reading an old article from 2018. Moreover, it mentioned an entity extractor “ner_crf”, so I added the following line:

- name: CRFEntityExtractor

to my config.yml, which I attached to the first post.

Despite this, the results are not good. I will try on a smaller sample and simplify the config.yml, to see if there is something interfering. I’ll keep you posted.

JosephCHS · July 21, 2020, 10:16am

I simply removed everything that was related to the Spacy extractor, the city detection works very well, too well…

The entity is extracted twice in a row and I end up with a list of two elements containing twice the name of the city, while my entity “cities” is: type: unfeaturized.

Maybe I need to tag all the intent with use_entities/ignore_entities, but it seems to be quite laborious and I’m not sure it will solve the double extraction problem.

Here is what it looks like:

## New Story
* greet
    - utter_greet_which_intent
* request_activity
    - utter_ask_inform_physical_activity
* inform_location{"cities":"Arradon"}
    - slot{"cities":"Arradon"}
    - where_form
    - form{"name":"where_form"}
    - slot{"cities":"Arradon"}
    - slot{"cities":["Arradon","Arradon"]}
    - form{"name":null}
    - slot{"requested_slot":null}
    - action_environment_advice

On the other hand, city names that are too close together are rather poorly detected. He confuses all cities beginning with “Saint” with the city “Saint”.

flore · July 24, 2020, 2:55pm

Hi @JosephCHS! CRFEntityExtractor learn from the data. I saw this Components

“As the featurizer is moving over the tokens in a user message with a sliding window, you can define features for previous tokens, the current token, and the next tokens in the sliding window. You define the features as [before, token, after] array.”

And then show this by default

pipeline:
- name: "CRFEntityExtractor"
  # BILOU_flag determines whether to use BILOU tagging or not.
  "BILOU_flag": True
  # features to extract in the sliding window
  "features": [
    ["low", "title", "upper"],
    [
      "bias",
      "low",
      "prefix5",
      "prefix2",
      "suffix5",
      "suffix3",
      "suffix2",
      "upper",
      "title",
      "digit",
      "pattern",
    ],
    ["low", "title", "upper"],
  ]
  # The maximum number of iterations for optimization algorithms.
  "max_iterations": 50
  # weight of the L1 regularization
  "L1_c": 0.1
  # weight of the L2 regularization
  "L2_c": 0.1

Maybe overwritting this and excluding preffix or suffix it can differenciate cities beginning with “Saint” with the city “Saint”.

Just an idea to try. On the other hand if saint is the only problem, maybe you can do a validation to this.

forwitai · September 16, 2020, 1:47pm

Hello Joseph,

Did you find a solution to the double extraction problem ?

Thanks.

JosephCHS · October 3, 2020, 6:17pm

Hi, I had to clean my data. Sometimes when I use rasa x to validate the data, the entity extraction is done twice, but the highlighted text does not show this. So I manually clean up the data via my IDE. The problem does not get worse by doing this.

Topic		Replies	Views
Entity Extraction not working with lookup table option Rasa Open Source	3	2014	July 17, 2019
Not able to identify locations from lookup table Rasa Open Source	4	398	September 18, 2020
Not able to extract name of any city as an entity by giving only few example in nlu intent? Rasa Open Source	5	286	August 11, 2023
How to use lookup tables for entity list Rasa Open Source	1	892	March 9, 2020
Rasa NLU Can't Find City Entities Rasa Open Source	2	1368	September 25, 2018

[Entity Extractor] Spacy / Lookup table

Related topics