Lookup Table didn't work for RegexEntityExtractor

Hi Rasa Community,

I use lookup table full of uppercase words for RegexEntityExtractor, in the doc NLU Training Data it is said that lookup table is case insensitive. However after cross validation, it turns out that my RegexEntityExtractor is not able to identify all those entities in lowercase. It seems that the lookup table is case sensitive? I’ve defined couple of training examples for the entities in lookup table and set case_sensitive of RegexFeaturizer to False.

This problem really confused me for a while, can someone help me to solve this?

This is how my pipeline looks like:

language: de

pipeline:
   - name: WhitespaceTokenizer
   - name: RegexFeaturizer
     case_sensitive: False
   - name: RegexEntityExtractor
     case_sensitive: False
   - name: LexicalSyntacticFeaturizer
   - name: CountVectorsFeaturizer
   - name: CountVectorsFeaturizer
     analyzer: char_wb
     min_ngram: 1
     max_ngram: 4
   - name: "DucklingEntityExtractor"
     url: "http://localhost:8000"
     dimensions: ["time", "number", "amount-of-money"]
     locale: "de_DE"
     timezone: "Europe/Berlin"
   - name: "CRFEntityExtractor"
   - name: DIETClassifier
     epochs: 100
   - name: EntitySynonymMapper
   - name: ResponseSelector
     epochs: 100
   - name: FallbackClassifier
     threshold: 0.75
     ambiguity_threshold: 0.1

thanks and regards

Hey, I’m facing the same issue. Did you find out the reason for this? Or how did you fix it?

@riya.shah Heya! you not able to fetch the data using lookup table? Right

Hey @nik202 Yes. I have a lookup table like this:

- lookup: insurance_provider
  examples: |
    - HDFC ERGO
    - HDFC
    - Tata AIG
    - Tata
    - ICICI
    - ICICI Lombard

And it is not able to extract entity. This is my DIETClassifier errors.json

{
    "text": "i think it was icici",
    "entities": [
      {
        "start": 15,
        "end": 20,
        "value": "ICICI",
        "entity": "insurance_provider"
      }
    ],
    "predicted_entities": []
  },
  {
    "text": "umm from tata",
    "entities": [
      {
        "start": 9,
        "end": 13,
        "value": "Tata",
        "entity": "insurance_provider"
      }
    ],
    "predicted_entities": []
  }

@riya.shah What issue you getting? You need to share some other supporting files please.

Hey, I am trying to extract the name of insurance companies from the user’s message.

This is the config file…

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: en

pipeline:
  - name: custom_nlu_components.CustomTranslator.CustomTranslator
    #    Required: translate_url
    translate_url: <translate_url>
    "source_language": auto
    "target_language": en
  - name: ConveRTTokenizer
  - name: ConveRTFeaturizer
    model_url: <model_url>
  - name: RegexFeaturizer
    case_sensitive: False
  - name: LexicalSyntacticFeaturizer
  - name: DucklingEntityExtractor
    url: "http://localhost:8000"
    dimensions: [ "time", "duration", "number" ]
    timeout: 5
    timezone: "Asia/Kolkata"
    locale: "en_IN"
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
    random_seed: 73
    model_confidence: linear_norm
    constrain_similarities: True
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100
    random_seed: 73
    model_confidence: linear_norm
    constrain_similarities: True
  - name: FallbackClassifier
    threshold: 0.45

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
    random_seed: 73
    model_confidence: linear_norm
    constrain_similarities: True
  - name: RulePolicy

These are some of the training examples:

- intent: in_share_vendor_details
    examples: |
      - Insurance agent from [ICICI](insurance_provider) had called me
      - its from [LIC](insurance_provider)
      - i can't recall , probably through [icici](insurance_provider)
      - I bought a policy from [Tata](insurance_provider)
      - [HDFC Ergo](insurance_provider)

These are test examples:

  - intent: in_share_vendor_details
    examples: |
      - from [Tata AIG](insurance_provider)
      - I have renewed from [hdfc](insurance_provider) life insurance
      - I think it was [ICICI](insurance_provider)
      - i bought it from [tata aig](insurance_provider)
      - Umm from [Tata](insurance_provider)

Now these examples are present in lookup table as well. When i run rasa test, entity extraction is failing for a lot of examples. Let me know if you need any other files.

@riya.shah please see this and follow, if still your problem persist I will try to run your code. meanwhile : https://youtu.be/gvyfQZMnHPY

Hey, I watched the video. I have implemented it the same way. What I observed is that if I keep the sentence structure of the test example the same as training, then entity extraction works. For e.g training example is

i renewed it from royal sundaram

And i use the below example for testing then it works

i renewed it from tata aig

But If the test example has a different sentence structure that is not seen in training, then it fails.

@riya.shah Well lookup is basically used for small words like locations, city, products etc with single words ok. What ever you mention, it basically train on that only out of scope not worked also when I used lookup so, now I not using. I hope it will help you. Further reading: NLU Training Data

Hey, it is getting classified to the right intent though. Also, the test examples do have single word entities, and it is failing for single words as well.

@riya.shah any screenshot ?

This is DIET Classifier error report & the lookup table in use.

As you can see, predicted entities is [], even though i have given such examples in lookup table. My question is, even if i don’t have any training example of “from xyz” & it is able to classify to the right intent, it should be able to extract entity as well right?

@riya.shah because he will only recognise intent as you mention in lookup table.