Misaligned entity annotation for '01/03' in sentence (...)

What are my options if I have defined two entities (DD and MM) and need to capture the following values: DD: 01 MM: 03

My entity data is annotated as presented above, but seems that I need separate tokens for ‘01’ and ‘03’ while keeping their positions. What is the best modification I can adopt if I still want to use the language model from spacy?

For my current data, I am getting the following warning from rasa train nlu: “Misaligned entity annotation for ‘01/03’ in sentence ‘I made the purchase 01/03 and still didn’t receive it’”

Perhaps you have to escape the “/” character?

Hey Leonardo,

How would I do this in the Rasa pipeline? Should I include the escapes in the raw text as a previous layer?

I am also considering two alternatives:

  • Creating a new spacy language model from the one I am using, copying it but adding some special cases/rules like a regex to make ‘01/03’ as three tokens: ‘01’, ‘/’ and ‘03’, then loading the new model in the beginning of the Rasa pipeline as something as:

    • name: “SpacyNLP”

      model: “my_updated_language_model”

  • Creating a new NLU tokenizer component using WhiteSpaceTokenizer source code as a template adding the regex for the date pattern, but losing some of the ‘intelligence’ of the spacy component that makes an unique token from ‘U.K.’ due to the special rules.

I am more prone to the custom spacyNLP language model, but I am still doing a little research on how to do this. Any readings that you would recommend?

I don’t know if I’m thinking too simple here, but, I would neither.

I would simple put on nlu.md samples to teach the RASA recognize as two disticnt items/slots, like:

## intent:date
- Some text to intent [Day](day) /[Month](month)
1 Like

Thank you for the reply.

As I am using an custom annotator for the intents/entities and due to the scale and connection to a database, I am using the json format for the nlu instances.

Do you think maybe the markdown format deals with this in a different way?

The sentence (notice that there are no whitespaces in the date on the raw text):

‘I made the purchase 01/03 and still didn’t receive it’

Is annotated as:

  {
    "text": "I made the purchase 01/03 and still didn’t receive it",
    "intent": "purchase_not_received",
    "entities": [
      {
        "raw": "01",
        "value": "01",
        "entity": "DD",
        "start": 20,
        "end": 22
      },
      {
        "raw": "03",
        "value": "03",
        "entity": "MM",
        "start": 23,
        "end": 25
      }
    ]
  }

Perhaps, you’re ahead of me, I still using the stock options of RASA / RASA X, but can you create a simple demo bot and try give some sample to it using the default annotator and see if it work as expected?