Lookup table not working for entities with multiple words

vishu1994 · February 26, 2020, 11:04am

I am using lookup table for extracting an entity “country” which has some 196 values. Lookup table looks something like this.

Sierra Leone
Puerto Rico
Belgium
Palau
Belize
Indonesia
Brunei
Macao
Hong Kong
Nicaragua
South Africa
Montserrat
Syria
Australia
Jordan
Guinea
Libya
Paraguay
St. Lucia
Israel 
Nigeria
Barbados
Kazakstan
Aland Islands

Results :

get me downloads count in hong kong
    {
      "intent": {
        "name": "getAppDownloadsCount",
        "confidence": 0.9998836517333984
      },
      "entities": [
        {
          "start": 26,
          "end": 30,
          "value": "hong",
          "entity": "country",
          "confidence": 0.5102069739663814,
          "extractor": "CRFEntityExtractor"
        },
        {
          "start": 31,
          "end": 35,
          "value": "kong",
          "entity": "country",
          "confidence": 0.7573467555695853,
          "extractor": "CRFEntityExtractor"
        }
      ],
      "intent_ranking": [
        {
          "name": "getAppDownloadsCount",
          "confidence": 0.9998836517333984

Ideally i should get hong kong as one single value.Can somebody helps or atleast tell why it happens in case of lookup table, as crf works fine with multiple words entities present in training data and if these values are in lookup table we get such results.

Tanja · March 3, 2020, 10:13am

What version of Rasa are you using? And how does your config file look like?

vishu1994 · March 4, 2020, 12:44pm

rasa version : 1.7.0

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: en
pipeline: pretrained_embeddings_spacy
  # Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

Tanja · March 5, 2020, 3:34pm

I just tried to reproduce the issue. For me everything seems to be working. Just to clarify, if you add the lookup table to your NLU data, Rasa predicts

"entities": [
        {
          "start": 26,
          "end": 30,
          "value": "hong",
          "entity": "country",
          "confidence": 0.5102069739663814,
          "extractor": "CRFEntityExtractor"
        },
        {
          "start": 31,
          "end": 35,
          "value": "kong",
          "entity": "country",
          "confidence": 0.7573467555695853,
          "extractor": "CRFEntityExtractor"
        }

But if you train without the lookup table, Rasa will combine the two entities into one?

vishu1994 · March 6, 2020, 6:18am

ya but “hong kong” are not two different entities, this should come as one single value.

I am assuming hong kong is not present in training data.Because if [hong kong] is there in training data it comes as single entity value, and entities which are there in lookup table having multiple words gives me this issue.

Tanja · March 6, 2020, 7:47am

I was able to reproduce the error and created an issue for it (CRFEntityExtractor splits one entity into two · Issue #5377 · RasaHQ/rasa · GitHub).

It should not be related to the lookup tables. It seems to be related to the BILOU_flag. Can you try to train your bot with the following pipeline and check if it works? Thanks.

language: "en"

pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
  BILOU_flag: True
- name: "EntitySynonymMapper"
- name: "SklearnIntentClassifier"

vishu1994 · March 6, 2020, 8:07am

Ya i trained the bot with the mentioned pipeline and i get same same results.

get me downloads count in hong kong
{
  "intent": {
    "name": "getAppDownloadsCount",
    "confidence": 0.9592514339348922
  },
  "entities": [
    {
      "start": 26,
      "end": 30,
      "value": "hong",
      "entity": "country",
      "confidence": 0.7003851556965186,
      "extractor": "CRFEntityExtractor"
    },
    {
      "start": 31,
      "end": 35,
      "value": "kong",
      "entity": "country",
      "confidence": 0.7234106247445744,
      "extractor": "CRFEntityExtractor"
    }
  ],

HuppeJ · March 31, 2020, 1:44am

Hello,

I had the same problem, but I solved it by adding nlu data with multiple words:

[…] [Costa Rica](country) […]
[…] [Dominican Republic](country) […]
[…] [Central African Republic](country) […]
[…] […] [Papua New Guinea](country) […]
[…] [Papua New Guinea](country) […]
...

vishu1994 · March 31, 2020, 6:52am

Ya, i also did the same thing, basically the training data should cover few values of each type of variation present in the lookup table, that way it learns better @Tanja

vish0701 · June 21, 2020, 9:11am

Hmm. Similar problem here. I did add several examples but not quite there.

Here’s what I have in nlu.md (listing only the relevant examples)

- list all [experts](expert_name) in the country of [South Africa]{"entity":"country", "value":"South Africa"} 
- list all [experts](expert_name) in the country of [New Zealand]{"entity":"country", "value":"New Zealand"}
- [United Kingdom](country)
- [United States](country)

My config.yml is as follows

    language: en
    pipeline:
      - name: SpacyNLP
      - name: DeSymbolizer
      - name: SpacyTokenizer
      - name: SpacyFeaturizer
      - name: RegexFeaturizer
      - name: LexicalSyntacticFeaturizer
      - name: CountVectorsFeaturizer
      - name: CountVectorsFeaturizer
        analyzer: "char_wb"
        min_ngram: 1
        max_ngram: 4
      - name: DIETClassifier
        epochs: 100
      - name: EntitySynonymMapper
      - name: ResponseSelector
        epochs: 100

    # Configuration for Rasa Core.
    # https://rasa.com/docs/rasa/core/policies/
    policies:
      - name: MemoizationPolicy
      - name: TEDPolicy
        max_history: 5
        epochs: 100
      - name: MappingPolicy

It still didn’t catch “El Salvador”; ended up extracting “Salvador” for country name and dropped “El”. Or “Marshall Islands”. Picked up only Islands. Maybe it recognizes Marshall as a person’s name and dropped it?

Do I need to add more samples? And, I also have countries with “&”. Looks like I will need to a few samples of those as well for it to work?

vishu1994 · June 22, 2020, 5:58am

Yes you need to cover all entities in the training data.

iszainab · September 14, 2020, 8:30am

Hi can I get some guidance please? I’m using the CRFEntityExtractor with BILOU_flag set as True and my model seems to be splitting the entities. I was under the impression that if I needed my compound entities to be treated as a single entity, I need to use BILOU tagging which seemed to be working fine for me till now. I recently switched from SklearnIntentClassifier to DIETClassifier (configured for only intent classification).

I know one solution is to add training phrases and I’ve done that but since I can’t possibly add all entities in training, I need a better solution. A solution at the config level that is.

Tanja · September 21, 2020, 9:21am

@iszainab Can you please share some more details? What Rasa version are you using? How does your pipeline look like? And can you please give an example of an entity that is split? Thanks.

cicciob95 · January 29, 2021, 11:17am

I have the same problem. The rasa version I am using is 1.10.9, while the NLU pipeline is:

name: “SpacyNLP” “case_sensitive”: False
name: “SpacyTokenizer”
name: “SpacyFeaturizer” “pooling”: “mean”
name: “RegexFeaturizer” “case_sensitive”: False
name: “CRFEntityExtractor” “epochs”: 300
name: “LexicalSyntacticFeaturizer”
name: “CountVectorsFeaturizer”
name: “CountVectorsFeaturizer” “analyzer”: “char_wb” “min_ngram”: 3 “max_ngram”: 5
name: “DIETClassifier” “entity_recognition”: False “epochs”: 300
name: “EntitySynonymMapper”

osterhult · January 29, 2021, 3:22pm

Just a thought regarding your case. Can your issue have something todo that you have two different entity extractors in your pipeline? “CRFEntityExtrator” and “DIETClassifier”?

I had an issue with that myself some time ago. (I am a newbie with Rasa)

cicciob95 · January 29, 2021, 4:37pm

I don’t know…but i don’t think this is the problem since i use the DIET only to classify the intent (“entity_recognition” is set to False).

osterhult · January 29, 2021, 8:29pm

Ok. I thought it was worth a shot at least.

cicciob95 · February 27, 2021, 12:05pm

I currently have the same problem. Were you able to solve it somehow? Thanks

GeethPriyanka · June 1, 2021, 11:15am

Were you able to solve the problem. ? I’m also facing the same

Topic		Replies	Views
How does the lookup table in rasa_nlu work? Is there something similar to keyword_intent_classifier for entity extractors? Rasa Open Source	6	5398	August 13, 2021
Question about optimal lookup table usage Rasa Open Source	5	1340	August 8, 2019
Lookup Table doesn't work Rasa Open Source	3	1723	June 6, 2019
Lookup Table or Multiple Examples? Rasa Open Source	12	3545	December 18, 2023
Lookup tables in Rasa Rasa Open Source	1	1187	November 25, 2021

Lookup table not working for entities with multiple words

Related topics