Lookup table not working for entities with multiple words

I am using lookup table for extracting an entity “country” which has some 196 values. Lookup table looks something like this.

Sierra Leone
Puerto Rico
Belgium
Palau
Belize
Indonesia
Brunei
Macao
Hong Kong
Nicaragua
South Africa
Montserrat
Syria
Australia
Jordan
Guinea
Libya
Paraguay
St. Lucia
Israel 
Nigeria
Barbados
Kazakstan
Aland Islands

Results :

get me downloads count in hong kong
    {
      "intent": {
        "name": "getAppDownloadsCount",
        "confidence": 0.9998836517333984
      },
      "entities": [
        {
          "start": 26,
          "end": 30,
          "value": "hong",
          "entity": "country",
          "confidence": 0.5102069739663814,
          "extractor": "CRFEntityExtractor"
        },
        {
          "start": 31,
          "end": 35,
          "value": "kong",
          "entity": "country",
          "confidence": 0.7573467555695853,
          "extractor": "CRFEntityExtractor"
        }
      ],
      "intent_ranking": [
        {
          "name": "getAppDownloadsCount",
          "confidence": 0.9998836517333984

Ideally i should get hong kong as one single value.Can somebody helps or atleast tell why it happens in case of lookup table, as crf works fine with multiple words entities present in training data and if these values are in lookup table we get such results.

What version of Rasa are you using? And how does your config file look like?

rasa version : 1.7.0

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: en
pipeline: pretrained_embeddings_spacy
  # Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

I just tried to reproduce the issue. For me everything seems to be working. Just to clarify, if you add the lookup table to your NLU data, Rasa predicts

"entities": [
        {
          "start": 26,
          "end": 30,
          "value": "hong",
          "entity": "country",
          "confidence": 0.5102069739663814,
          "extractor": "CRFEntityExtractor"
        },
        {
          "start": 31,
          "end": 35,
          "value": "kong",
          "entity": "country",
          "confidence": 0.7573467555695853,
          "extractor": "CRFEntityExtractor"
        }

But if you train without the lookup table, Rasa will combine the two entities into one?

ya but “hong kong” are not two different entities, this should come as one single value.

I am assuming hong kong is not present in training data.Because if [hong kong] is there in training data it comes as single entity value, and entities which are there in lookup table having multiple words gives me this issue.

I was able to reproduce the error and created an issue for it (CRFEntityExtractor splits one entity into two · Issue #5377 · RasaHQ/rasa · GitHub).

It should not be related to the lookup tables. It seems to be related to the BILOU_flag. Can you try to train your bot with the following pipeline and check if it works? Thanks.

language: "en"

pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
  BILOU_flag: True
- name: "EntitySynonymMapper"
- name: "SklearnIntentClassifier"

Ya i trained the bot with the mentioned pipeline and i get same same results.

get me downloads count in hong kong
{
  "intent": {
    "name": "getAppDownloadsCount",
    "confidence": 0.9592514339348922
  },
  "entities": [
    {
      "start": 26,
      "end": 30,
      "value": "hong",
      "entity": "country",
      "confidence": 0.7003851556965186,
      "extractor": "CRFEntityExtractor"
    },
    {
      "start": 31,
      "end": 35,
      "value": "kong",
      "entity": "country",
      "confidence": 0.7234106247445744,
      "extractor": "CRFEntityExtractor"
    }
  ],

Hello,

I had the same problem, but I solved it by adding nlu data with multiple words:

[…] [Costa Rica](country) […]
[…] [Dominican Republic](country) […]
[…] [Central African Republic](country) […]
[…] […] [Papua New Guinea](country) […]
[…] [Papua New Guinea](country) […]
...
1 Like

Ya, i also did the same thing, basically the training data should cover few values of each type of variation present in the lookup table, that way it learns better @Tanja

Hmm. Similar problem here. I did add several examples but not quite there.

Here’s what I have in nlu.md (listing only the relevant examples)

- list all [experts](expert_name) in the country of [South Africa]{"entity":"country", "value":"South Africa"} 
- list all [experts](expert_name) in the country of [New Zealand]{"entity":"country", "value":"New Zealand"}
- [United Kingdom](country)
- [United States](country)

My config.yml is as follows

    language: en
    pipeline:
      - name: SpacyNLP
      - name: DeSymbolizer
      - name: SpacyTokenizer
      - name: SpacyFeaturizer
      - name: RegexFeaturizer
      - name: LexicalSyntacticFeaturizer
      - name: CountVectorsFeaturizer
      - name: CountVectorsFeaturizer
        analyzer: "char_wb"
        min_ngram: 1
        max_ngram: 4
      - name: DIETClassifier
        epochs: 100
      - name: EntitySynonymMapper
      - name: ResponseSelector
        epochs: 100

    # Configuration for Rasa Core.
    # https://rasa.com/docs/rasa/core/policies/
    policies:
      - name: MemoizationPolicy
      - name: TEDPolicy
        max_history: 5
        epochs: 100
      - name: MappingPolicy

It still didn’t catch “El Salvador”; ended up extracting “Salvador” for country name and dropped “El”. Or “Marshall Islands”. Picked up only Islands. Maybe it recognizes Marshall as a person’s name and dropped it?

Do I need to add more samples? And, I also have countries with “&”. Looks like I will need to a few samples of those as well for it to work?

Yes you need to cover all entities in the training data.

Hi can I get some guidance please? I’m using the CRFEntityExtractor with BILOU_flag set as True and my model seems to be splitting the entities. I was under the impression that if I needed my compound entities to be treated as a single entity, I need to use BILOU tagging which seemed to be working fine for me till now. I recently switched from SklearnIntentClassifier to DIETClassifier (configured for only intent classification).

I know one solution is to add training phrases and I’ve done that but since I can’t possibly add all entities in training, I need a better solution. A solution at the config level that is.

@iszainab Can you please share some more details? What Rasa version are you using? How does your pipeline look like? And can you please give an example of an entity that is split? Thanks.

I have the same problem. The rasa version I am using is 1.10.9, while the NLU pipeline is:

  • name: “SpacyNLP” “case_sensitive”: False
  • name: “SpacyTokenizer”
  • name: “SpacyFeaturizer” “pooling”: “mean”
  • name: “RegexFeaturizer” “case_sensitive”: False
  • name: “CRFEntityExtractor” “epochs”: 300
  • name: “LexicalSyntacticFeaturizer”
  • name: “CountVectorsFeaturizer”
  • name: “CountVectorsFeaturizer” “analyzer”: “char_wb” “min_ngram”: 3 “max_ngram”: 5
  • name: “DIETClassifier” “entity_recognition”: False “epochs”: 300
  • name: “EntitySynonymMapper”

Just a thought regarding your case. Can your issue have something todo that you have two different entity extractors in your pipeline? “CRFEntityExtrator” and “DIETClassifier”?

I had an issue with that myself some time ago. (I am a newbie with Rasa) :slight_smile:

I don’t know…but i don’t think this is the problem since i use the DIET only to classify the intent (“entity_recognition” is set to False).

Ok. I thought it was worth a shot at least. :slight_smile:

1 Like

I currently have the same problem. Were you able to solve it somehow? Thanks

Were you able to solve the problem. ? I’m also facing the same