Pattern extraction problem with DucklingEntityExtractor

ali_ch · November 30, 2021, 9:58am

Dear all,

I trained my rasa NLU model to extract entities using DIETClassifier. In addition, I am extracting dates using DucklingEntityExtractor. An example of utterance is:

give me information about ABC_14 from 2021-10-01 to 2021-11-05.

In this utterance, my model predicts that ABC_14 is the entity (which is correct), however DucklingEntityExtractor is using “14 from 2021-10-01 to 2021-11-05” to extract the dates which obviously leads to wrong dates. So each time my entity terminates with a number, the pattern used to extract the dates is wrong (DIETClassifier overlaps with DucklingEntityExtractor).

Entity and dates are correctly extracted for an utterance like:

give me information about ABC from 2021-10-01 to 2021-11-05.

Any idea how can I solve this problem, please ?

souvikg10 · November 30, 2021, 11:31am

Weird… why would Diet overlap with duckling. For duckling you don’t need to tag training data.

Duckling can obviously return incorrect format when it sees numbers and date because it uses a regex pattern but still in your example duckling shouldn’t get confused

Can you provide the incorrect response you get in the logs? What does duckling responds

ali_ch · November 30, 2021, 11:57am

Thank you for your answer. I am not sure it generated logs … I can’t find any log files. I am testing it with “rasa shell nlu”. Here is part of the output, this is what you mean ?

give me information about A_10 from 2021-10-10 to 2021-11-01
{
  "text": "give me information about A_10 from 2021-10-10 to 2021-11-01",
  "intent": {
    "id": 3693568570979763547,
    "name": "getData",
    "confidence": 1.0
  },
  "entities": [
    {
      "entity": "physicalValue",
      "start": 0,
      "end": 2,
      "confidence_entity": 0.9998657703399658,
      "value": "information",
      "extractor": "DIETClassifier"
    },
    {
      "entity": "objName",
      "start": 10,
      "end": 14,
      "confidence_entity": 0.9999691247940063,
      "value": "A_10",
      "extractor": "DIETClassifier"
    },
    {
      "start": 12,
      "end": 44,
      "text": "10 from 2021-10-10 to 2021-11-01",
      "value": "2021-10-10T10:00:00.000+02:00",
      "confidence": 1.0,
      "additional_info": {
        "values": [
          {
            "value": "2021-10-10T10:00:00.000+02:00",
            "grain": "hour",
            "type": "value"
          },
          {
            "value": "2021-10-10T22:00:00.000+02:00",
            "grain": "hour",
            "type": "value"
          },
          {
            "value": "2021-10-11T10:00:00.000+02:00",
            "grain": "hour",
            "type": "value"
          }
        ],
        "value": "2021-10-10T10:00:00.000+02:00",
        "grain": "hour",
        "type": "value"
      },
      "entity": "time",
      "extractor": "DucklingEntityExtractor"
    }
  ],

I modified my pipeline by replacing SpacyTokenizer with WhitespaceTokenizer so that the token A_10 won’t be spitted. But nothing changed, same problem.

souvikg10 · November 30, 2021, 2:08pm

Yeah this is coming from duckling I tried just the duckling server, the sentence is picked up from 10 from 2021-10-10 and thus messing the extraction

It is just how duckling interprets the tokens and then the Time Rules picks up the 10 as an hour grain. So i guess you can raise this as an issue in duckling repository and they can suggest you some modifications you can do in the tokenisation but i am not sure if that is something you can do from Rasa.

Another option is simply create a custom components that connects to Duckling but mask entities as they are already extracted by DIET. you can do so by placing your custom components after DIET( similar to entity synonym mapper) Since DIET would spit already the entities and their indices on the sentences, you can mask those before sending the remainder of the input to Duckling. This way you would avoid overlaps on product information.

This doesn’t guarantee that if a user provides another sentence not picked up by DIET such as Give me information about 10 from 2021-10-10 to 2021-11-01 - this would fail if DIET doesn’t pick up 10 instead of A10.

but you still need to do this in a custom component.

ali_ch · November 30, 2021, 2:50pm

Many thanks for your time. So you don’t think the problem may be caused by a bad Tokenization? We have the same problem even with the message “give me information about A10 from 2021-10-10 to 2021-11-01” … (A10, the number 10 is not separated by a space).

souvikg10 · November 30, 2021, 3:03pm

Tokenisation of Rasa has no effect on duckling. Duckling extractor implemented in rasa sends the full sentence to duckling and thus the tokenisation is done at the side of duckling before regex is applied. I am not sure whether there is a configuration to adjust tokenisation in duckling itself.

faazeez · February 24, 2023, 5:34am

Hi @souvikg10 ,Can you please let us know how to create a custom components that connects to Duckling but mask entities. Any sample would be of good help

Topic		Replies	Views
ValueError _raise_on_same_start_and_different_end_positions Rasa Open Source	0	372	August 23, 2022
CRF and DIET parse same entity problem Rasa Open Source	5	463	July 19, 2021
Is there another entity extracter other than duckling that can extract entites such us tommor ,next week,next month as a date? Rasa Open Source	4	332	October 29, 2020
Issue in extracting Date/Time entities Rasa Open Source	2	1519	December 21, 2018
Extracting date from user Rasa Open Source	21	3436	March 15, 2024

Pattern extraction problem with DucklingEntityExtractor

Related topics