Misaligned entity annotation for '01/03' in sentence (...)

bayesianwannabe · March 25, 2020, 11:59am

What are my options if I have defined two entities (DD and MM) and need to capture the following values: DD: 01 MM: 03

My entity data is annotated as presented above, but seems that I need separate tokens for ‘01’ and ‘03’ while keeping their positions. What is the best modification I can adopt if I still want to use the language model from spacy?

For my current data, I am getting the following warning from rasa train nlu: “Misaligned entity annotation for ‘01/03’ in sentence ‘I made the purchase 01/03 and still didn’t receive it’”

lluchini · March 25, 2020, 12:48pm

Perhaps you have to escape the “/” character?

bayesianwannabe · March 25, 2020, 1:07pm

Hey Leonardo,

How would I do this in the Rasa pipeline? Should I include the escapes in the raw text as a previous layer?

I am also considering two alternatives:

Creating a new spacy language model from the one I am using, copying it but adding some special cases/rules like a regex to make ‘01/03’ as three tokens: ‘01’, ‘/’ and ‘03’, then loading the new model in the beginning of the Rasa pipeline as something as:
- name: “SpacyNLP”
  
  model: “my_updated_language_model”
Creating a new NLU tokenizer component using WhiteSpaceTokenizer source code as a template adding the regex for the date pattern, but losing some of the ‘intelligence’ of the spacy component that makes an unique token from ‘U.K.’ due to the special rules.

I am more prone to the custom spacyNLP language model, but I am still doing a little research on how to do this. Any readings that you would recommend?

lluchini · March 25, 2020, 1:24pm

I don’t know if I’m thinking too simple here, but, I would neither.

I would simple put on nlu.md samples to teach the RASA recognize as two disticnt items/slots, like:

## intent:date
- Some text to intent [Day](day) /[Month](month)

bayesianwannabe · March 25, 2020, 1:38pm

Thank you for the reply.

As I am using an custom annotator for the intents/entities and due to the scale and connection to a database, I am using the json format for the nlu instances.

Do you think maybe the markdown format deals with this in a different way?

The sentence (notice that there are no whitespaces in the date on the raw text):

‘I made the purchase 01/03 and still didn’t receive it’

Is annotated as:

  {
    "text": "I made the purchase 01/03 and still didn’t receive it",
    "intent": "purchase_not_received",
    "entities": [
      {
        "raw": "01",
        "value": "01",
        "entity": "DD",
        "start": 20,
        "end": 22
      },
      {
        "raw": "03",
        "value": "03",
        "entity": "MM",
        "start": 23,
        "end": 25
      }
    ]
  }

lluchini · March 25, 2020, 2:09pm

Perhaps, you’re ahead of me, I still using the stock options of RASA / RASA X, but can you create a simple demo bot and try give some sample to it using the default annotator and see if it work as expected?

Topic		Replies	Views
After using SpacyTokenizer: Misaligned entity annotation error when using CRFEntityExtraction Rasa Open Source	0	1050	February 24, 2020
NLU not predicting entities separated by the '/' character in the new version of Rasa. Why? Rasa Open Source	3	502	June 11, 2020
[HELP NEEDED] Misaligned entity annotation in message Rasa Open Source	6	1838	September 13, 2022
Misaligned entity annotation error for custom NER Rasa Open Source	0	810	July 4, 2019
Misaligned entity annotation Rasa Open Source	7	4614	June 3, 2020

Misaligned entity annotation for '01/03' in sentence (...)

Related topics