Pattern extraction problem with DucklingEntityExtractor

Dear all,

I trained my rasa NLU model to extract entities using DIETClassifier. In addition, I am extracting dates using DucklingEntityExtractor. An example of utterance is:

give me information about ABC_14 from 2021-10-01 to 2021-11-05.

In this utterance, my model predicts that ABC_14 is the entity (which is correct), however DucklingEntityExtractor is using “14 from 2021-10-01 to 2021-11-05” to extract the dates which obviously leads to wrong dates. So each time my entity terminates with a number, the pattern used to extract the dates is wrong (DIETClassifier overlaps with DucklingEntityExtractor).

Entity and dates are correctly extracted for an utterance like:

give me information about ABC from 2021-10-01 to 2021-11-05.

Any idea how can I solve this problem, please ?

1 Like

Weird… why would Diet overlap with duckling. For duckling you don’t need to tag training data.

Duckling can obviously return incorrect format when it sees numbers and date because it uses a regex pattern but still in your example duckling shouldn’t get confused

Can you provide the incorrect response you get in the logs? What does duckling responds

2 Likes

Thank you for your answer. I am not sure it generated logs … I can’t find any log files. I am testing it with “rasa shell nlu”. Here is part of the output, this is what you mean ?

give me information about A_10 from 2021-10-10 to 2021-11-01
{
  "text": "give me information about A_10 from 2021-10-10 to 2021-11-01",
  "intent": {
    "id": 3693568570979763547,
    "name": "getData",
    "confidence": 1.0
  },
  "entities": [
    {
      "entity": "physicalValue",
      "start": 0,
      "end": 2,
      "confidence_entity": 0.9998657703399658,
      "value": "information",
      "extractor": "DIETClassifier"
    },
    {
      "entity": "objName",
      "start": 10,
      "end": 14,
      "confidence_entity": 0.9999691247940063,
      "value": "A_10",
      "extractor": "DIETClassifier"
    },
    {
      "start": 12,
      "end": 44,
      "text": "10 from 2021-10-10 to 2021-11-01",
      "value": "2021-10-10T10:00:00.000+02:00",
      "confidence": 1.0,
      "additional_info": {
        "values": [
          {
            "value": "2021-10-10T10:00:00.000+02:00",
            "grain": "hour",
            "type": "value"
          },
          {
            "value": "2021-10-10T22:00:00.000+02:00",
            "grain": "hour",
            "type": "value"
          },
          {
            "value": "2021-10-11T10:00:00.000+02:00",
            "grain": "hour",
            "type": "value"
          }
        ],
        "value": "2021-10-10T10:00:00.000+02:00",
        "grain": "hour",
        "type": "value"
      },
      "entity": "time",
      "extractor": "DucklingEntityExtractor"
    }
  ],

I modified my pipeline by replacing SpacyTokenizer with WhitespaceTokenizer so that the token A_10 won’t be spitted. But nothing changed, same problem.

1 Like

Yeah this is coming from duckling :frowning: I tried just the duckling server, the sentence is picked up from 10 from 2021-10-10 and thus messing the extraction

It is just how duckling interprets the tokens and then the Time Rules picks up the 10 as an hour grain. So i guess you can raise this as an issue in duckling repository and they can suggest you some modifications you can do in the tokenisation but i am not sure if that is something you can do from Rasa.

Another option is simply create a custom components that connects to Duckling but mask entities as they are already extracted by DIET. you can do so by placing your custom components after DIET( similar to entity synonym mapper) Since DIET would spit already the entities and their indices on the sentences, you can mask those before sending the remainder of the input to Duckling. This way you would avoid overlaps on product information.

This doesn’t guarantee that if a user provides another sentence not picked up by DIET such as Give me information about 10 from 2021-10-10 to 2021-11-01 - this would fail if DIET doesn’t pick up 10 instead of A10.

but you still need to do this in a custom component.

3 Likes

Many thanks for your time. So you don’t think the problem may be caused by a bad Tokenization? We have the same problem even with the message “give me information about A10 from 2021-10-10 to 2021-11-01” … (A10, the number 10 is not separated by a space).

Tokenisation of Rasa has no effect on duckling. Duckling extractor implemented in rasa sends the full sentence to duckling and thus the tokenisation is done at the side of duckling before regex is applied. I am not sure whether there is a configuration to adjust tokenisation in duckling itself.

1 Like

Hi @souvikg10 ,Can you please let us know how to create a custom components that connects to Duckling but mask entities. Any sample would be of good help