Hello everyone!
I was wondering if someone could help me with the following issue: I need to create in Rasa 3.0.9 a model that’s able to extract an entity containing only 7 or 8 digits without whitespaces inside.
For example, if I get this input: “1234 567”, it should extract it as “1234567” in the entity’s value. The same would happen in cases such as: " 12345678", "12345678 " or “1 2 3 4 5 6 7 8” (for these cases the entity’s value should be extracted as “12345678”, with no adjacent or inner spaces).
Next message:
30 000 000
{
"text": "30 000 000",
"intent": {
"name": "identificarse",
"confidence": 1.0
},
"entities": [
{
"entity": "dni",
"start": 0,
"end": 10,
"value": "30 000 000",
"extractor": "RegexEntityExtractor"
}
],
"text_tokens": [
[
0,
2
],
[
3,
6
],
[
7,
10
]
],
"intent_ranking": [
{
"name": "identificarse",
"confidence": 1.0
},
{
"name": "saludo",
"confidence": 6.369950678042358e-10
}
],
"response_selector": {
"all_retrieval_intents": [],
"default": {
"response": {
"responses": null,
"confidence": 0.0,
"intent_response_key": null,
"utter_action": "utter_None"
},
"ranking": []
}
}
}
I tried the following regex, but still the entity’s value is extracted with spaces:
(?<!\d)(?<!\d )(?:(?:\d *){7}|(?:\d *){8})(?<! )(?! ?\d)
I also tried turning the Whitespace Tokenizer off on the pipeline, but the model throws an error when I want to train it.
Is it possible to solve this type of extraction with a regex or some pipeline component (like a tokenizer or a featurizer)? Or is it something that can only be solved with custom actions?
Thank you so much for your help!