What are my options if I have defined two entities (DD and MM) and need to capture the following values:
DD: 01
MM: 03
My entity data is annotated as presented above, but seems that I need separate tokens for ‘01’ and ‘03’ while keeping their positions. What is the best modification I can adopt if I still want to use the language model from spacy?
For my current data, I am getting the following warning from rasa train nlu:
“Misaligned entity annotation for ‘01/03’ in sentence ‘I made the purchase 01/03 and still didn’t receive it’”
How would I do this in the Rasa pipeline? Should I include the escapes in the raw text as a previous layer?
I am also considering two alternatives:
Creating a new spacy language model from the one I am using, copying it but adding some special cases/rules like a regex to make ‘01/03’ as three tokens: ‘01’, ‘/’ and ‘03’, then loading the new model in the beginning of the Rasa pipeline as something as:
name: “SpacyNLP”
model: “my_updated_language_model”
Creating a new NLU tokenizer component using WhiteSpaceTokenizer source code as a template adding the regex for the date pattern, but losing some of the ‘intelligence’ of the spacy component that makes an unique token from ‘U.K.’ due to the special rules.
I am more prone to the custom spacyNLP language model, but I am still doing a little research on how to do this. Any readings that you would recommend?
As I am using an custom annotator for the intents/entities and due to the scale and connection to a database, I am using the json format for the nlu instances.
Do you think maybe the markdown format deals with this in a different way?
The sentence (notice that there are no whitespaces in the date on the raw text):
‘I made the purchase 01/03 and still didn’t receive it’
Is annotated as:
{
"text": "I made the purchase 01/03 and still didn’t receive it",
"intent": "purchase_not_received",
"entities": [
{
"raw": "01",
"value": "01",
"entity": "DD",
"start": 20,
"end": 22
},
{
"raw": "03",
"value": "03",
"entity": "MM",
"start": 23,
"end": 25
}
]
}
Perhaps, you’re ahead of me, I still using the stock options of RASA / RASA X, but can you create a simple demo bot and try give some sample to it using the default annotator and see if it work as expected?