New to RASA 2 - entity extraction for large lists

Hello,

I am new to Rasa, working with 2.x and I have a question in relation to entities.

I have a simple list of specific cities (around 200) I want to use for entity lookup and validate within a form. Synonyms seem to be a reasonable approach as these normalize to my values, but do I need to create intent training examples for all my values? For example, if I have 10 intent variations, do I multiply this by the 200 cities in essence meaning 2000 training examples?

Is there a better approach? I am using the default config i.e. DIET classifier.

Any advice very welcome!

Thanks

1 Like

Hey @pomegran, welcome to the forum :slight_smile:

Before we go down the rabbit hole that is creating 2000 training examples: Have you considered using RegexEntityExtractor for your city entity? I think it’s meant exactly for cases like yours.

Hi Sam,

Agreed, creating thousands of examples is a terrible idea!

I went down the “lookup” route which is a case-insensitive RegEx route (according to the docs). I would like to ask though if this is expected behaviour …

  1. Created a simple entity marked up with a city entity
  2. Created a list of cities in a lookup
  • intent: what_is_weather_like examples: |
  • lookup: city examples: |
    • berlin
    • london
    • stockholm
    • barcelona
    • rome
    • madrid
    • worcester
    • gloucester
    • birmingham
    • cheltenham
    • chelters
    • chelts

This works ok but I can’t use synonyms to normalize, so I do that in my validation_city method instead for the slot. This works fine (with a little bit of work but it works great!).

However, even though I have trained the entity city to appear late on in my intent examples, I get this response for the input “berlin whats the weather like in cheltenham”

 "text": "berlin whats the weather like in cheltenham",
  "intent": {
    "id": 8382829466045491448,
    "name": "what_is_weather_like",
    "confidence": 0.983062207698822
  },
  "entities": [
    {
      "entity": "city",
      "start": 0,
      "end": 6,
      "value": "berlin",
      "extractor": "RegexEntityExtractor"
    },
    {
      "entity": "city",
      "start": 33,
      "end": 43,
      "value": "cheltenham",
      "extractor": "RegexEntityExtractor"
    },
    {
      "entity": "city",
      "start": 0,
      "end": 6,
      "confidence_entity": 0.9866355061531067,
      "value": "berlin",
      "extractor": "DIETClassifier"
    }

Note how the DIETClassifier doesn’t pick up “cheltenham” but “berlin”. Using lookups are there any point to placing entities in intent data? I would have expected the classifier to pick up the second city, not the first (based on my training examples).

Can you guide? Apologise if these seem rudimentary but intent and accurate entity extraction are key to what I am working on.

Thanks Sam!

Just so I understand: Is it important for you to actually extract the city entity with DIETClassifier? If it’s enough to just extract it with RegexEntityExtractor, then I wouldn’t bother annotating it in many intent examples (the docs say 2 examples are enough for the regex extractor) – and you should be able to safely ignore any mistakes that DIET makes regarding this entity. Note: If this is your only entity, you could even turn of entity extraction in DIET altogether.

Hi Sam,

I was hoping the DIET classifier would use the examples for position of entity and apply this to the lookup allowing me to use confidence as a way of predicting the best entity to use. I am coming around to your thinking - I don’t think the DIETClassifier is reliable enough to do that. In this example I will just go with the lookup which is predictable.

Interestingly I tried entity roles (a simple “from”/“to” example for destinations and origins) and even found those to be unreliable. I’ll post something separate on that Sam.

Thanks for your guidance!

Hi Sam,

Just so you know I switched to the CRFEntityExtractor which was more reliable. This now picks out the correct entity for ambiguous inputs as per below (for the input “berlin whats the weather like in cheltenham”):

{
  "text": "berlin whats the weather like in cheltenham",
  "intent": {
    "id": -7706887359159972805,
    "name": "what_is_weather_like",
    "confidence": 0.9988484382629395
  },
  "entities": [
    {
      "entity": "city",
      "start": 0,
      "end": 6,
      "value": "berlin",
      "extractor": "RegexEntityExtractor"
    },
    {
      "entity": "city",
      "start": 33,
      "end": 43,
      "value": "cheltenham",
      "extractor": "RegexEntityExtractor"
    },
    {
      "entity": "city",
      "start": 33,
      "end": 43,
      "confidence_entity": 0.9007704465867381,
      "value": "cheltenham",
      "extractor": "CRFEntityExtractor"
    }
  ]

As soon as I added entity roles into this it did not work. I know this is experimental but is the entity/role/group functionality only available in DIET?

Thanks Mark

I was hoping the DIET classifier would use the examples for position of entity and apply this to the lookup

Well, DIET doesn’t interact with lookups in any way. It only learns from the actual intent examples…

Interestingly I tried entity roles (a simple “from”/“to” example for destinations and origins) and even found those to be unreliable

Roles/groups require a lot of training examples to be learned reliably. It’s sad, but that’s how it is (and this applies to many neural network models).

is the entity/role/group functionality only available in DIET?

No, it’s available both in DIET and in the CRF extractor. Actually, under the hood, DIET uses the same CRF extractor for entity extraction. By the way, I think CRF alone could be easier to train with fewer examples (since DIET also trains transformer layers that precede the CRF). However, I wouldn’t rely on such trained CRF to perform well, especially if it also predicts roles/groups. Sometimes the model really just needs more data to learn from and there’s no good way around it :wink:

Thanks Sam.

I’m not too bothered about using lookups. I am more concerned with extracting the position of the entity in an input - I can match it up once the value is extracted (as I have done already).

I think what I’m understanding is that DIET will just need a load more examples to be accurate. Am I understanding correctly? Otherwise, in your experience, if accurate position of entity is important, is DIET the best to go with or something else?

Many thanks

DIET will just need a load more examples to be accurate

Yes, I think so. Of course, you’ll also need a decent number of training epochs, but the default value (300) should often do the job.

As for determining the position of the entity, CRF (whether inside DIET or on its own) extracts entities together with their position, see the output that looks something like:

{
      "entity": "city",
      "start": 33,
      "end": 43,
      ...

The model learns to extract entities at the correct positions and can’t really achieve one without the other :slightly_smiling_face:

@pomegran I now realised I was wrong when I said:

DIET doesn’t interact with lookups in any way

Apologies, I’ll try to correct myself: DIET can (indirectly) take advantage of lookup tables, even though you still have to provide many training examples. (More in the docs on regexes and lookup tables for entity extraction.) If I understand it correctly, by using a lookup table, a RegexFeaturizer will automatically pick up any of the lookup table entries (here city names) when they occur in intent examples. The featurizer’s outputs (features) will be consumed by DIET. However, so that DIET learns to take advantage of these features and use them to predict entities, you still have to annotate the entity in many intent examples.

Hi Sam,

I have made some good progress so let me update you (as it may be useful for others too!)

  • I am using the DIET Classifier, disabling its entity recognition and enabling CRFEntityExtractor
  • Increased training examples and also created another intent with around 20 or so examples using “from” and “to” roles for origins and destinations using the same entity (city)
  • Using a lookup list for city

At the moment, it seems to work really well i.e. far better at predicting the correct entity in an input. I am avoiding synonyms and adding these to my lookup list. I am coding the normalization piece using information from the “tracker” object - gives me more flexibility.

Thanks for your help thus far!

2 Likes

@pomegran can you please help with the code/GitHub repo to refer what you have done. I am new to the forum and trying similar stuff. Thanks