Recognize entities separated by a dash

(ger) #1

In the german language entities are often separated by just a dash.

For example:

Zeige mir alle Verbindungen für Zürich-Bern.

I do have an Intent “recognize_connection” with this training-sample:

Zeige mir alle Verbindungen für [Zürich](start)-[Bern](end)

The idea is to recognize the startpoint and the endpoint of the connection as a separate entity.

This does currently not work. I think it is because I use the SpaCy-Tokenizer (tokenizer_spacy in my nlu_config), which does split the sentence into tokens on every whitespace. This means “Zürich-Bern” is a single token which will make it difficult to find multiple entities in it. This is at least my current assumption.

How could that be solved?

(Steve) #2

I wonder if adding a custom component, at the start of your NLU pipeline, to remove the dashes between the words, would work?

So user input such as…

Zeige mir alle Verbindungen für Zürich-Bern.

…would become…

Zeige mir alle Verbindungen für Zürich Bern.

This would allow you to have training examples like the following…

Zeige mir alle Verbindungen für [Zürich](start) [Bern](end)

Something like this might work as a custom component as long as dashes are not important elsewhere in your input (negative numbers perhaps?) You may need to add any missing imports and adjust the TODO line if you want to remove any other punctuation…

# -*- coding: utf-8 -*-

import re
from rasa_nlu.components import Component

class Preprocessor(Component):
    name = "remove_punctuation"
    provides = [name]

    def __init__(self, component_config=None):
        super(Preprocessor, self).__init__(component_config)

    def process(self, message, **kwargs):
        phrase = message.text

        # Remove unneeded punctuation.
        # TODO: add any punctuation marks you want removed...
        exclude = "-"
        table = str.maketrans(exclude, ' ' * len(exclude))
        phrase = phrase.translate(table)

        # convert 2+ spaces -> 1 space.
        phrase = re.sub(r'\s{2,}', " ", phrase)

        # Trim leading or trailing spaces...
        phrase = phrase.strip()

        found = []
        entry = {
            "original": message.text,
            "cleansed": phrase
        message.set(, message.get(, []) + found,
        message.text = phrase

You’ll need to add it into your pipeline to try it.