Remove whitespace from entity

Hello everyone!

I was wondering if someone could help me with the following issue: I need to create in Rasa 3.0.9 a model that’s able to extract an entity containing only 7 or 8 digits without whitespaces inside.

For example, if I get this input: “1234 567”, it should extract it as “1234567” in the entity’s value. The same would happen in cases such as: " 12345678", "12345678 " or “1 2 3 4 5 6 7 8” (for these cases the entity’s value should be extracted as “12345678”, with no adjacent or inner spaces).

Next message:
30 000 000
{
  "text": "30 000 000",
  "intent": {
    "name": "identificarse",
    "confidence": 1.0
  },
  "entities": [
    {
      "entity": "dni",
      "start": 0,
      "end": 10,
      "value": "30 000 000",
      "extractor": "RegexEntityExtractor"
    }
  ],
  "text_tokens": [
    [
      0,
      2
    ],
    [
      3,
      6
    ],
    [
      7,
      10
    ]
  ],
  "intent_ranking": [
    {
      "name": "identificarse",
      "confidence": 1.0
    },
    {
      "name": "saludo",
      "confidence": 6.369950678042358e-10
    }
  ],
  "response_selector": {
    "all_retrieval_intents": [],
    "default": {
      "response": {
        "responses": null,
        "confidence": 0.0,
        "intent_response_key": null,
        "utter_action": "utter_None"
      },
      "ranking": []
    }
  }
}

I tried the following regex, but still the entity’s value is extracted with spaces:

(?<!\d)(?<!\d )(?:(?:\d *){7}|(?:\d *){8})(?<! )(?! ?\d)

I also tried turning the Whitespace Tokenizer off on the pipeline, but the model throws an error when I want to train it.

Is it possible to solve this type of extraction with a regex or some pipeline component (like a tokenizer or a featurizer)? Or is it something that can only be solved with custom actions?

Thank you so much for your help!

1 Like

The easiest way to do it is in a custom action, just extract the slot with the spaces as is the case now, then remove the spaces from the string.

Another, harder solution is to write a custom component.

1 Like