Rasa com

mloubser · April 21, 2020, 7:39pm

Ok, I see what’s happening here. In these two functions in rasa.nlu.training_data.message, you see how the text attribute is handled differently than the other attributes? so message.data["text"] will be what you’re expecting, but message.text will be left untouched.

    def set(self, prop, info, add_to_output=False) -> None:
        self.data[prop] = info
        if add_to_output:
            self.output_properties.add(prop)

    def get(self, prop, default=None) -> Any:
        if prop == TEXT_ATTRIBUTE:
            return self.text
        return self.data.get(prop, default)

You can get around this by doing:

    def process(self, message: Message, **kwargs: Any) -> None:
        message.text = self.tokenize(message.text)

However, since you’re creating a tokenizer, it would make sense for your class to return tokens, not text, which would be handled correctly. Using WhitespaceTokenizer over pre-tokenized text wouldn’t really make sense, in my opinion.

For example, I passed the sentence “Are you a bot?” to the model, and Lemmatizationtokenizer returns

be -PRON- a bot?

But after WhitespaceTokenizer is done with it, you get

be   pron   a bot?

Topic		Replies	Views
Custom NLU component for length of input string Rasa Open Source	5	605	June 28, 2020
Adding text preprocessing component to Rasa Rasa Open Source	4	1328	March 9, 2021
Custom nlu pipeline component Rasa Open Source	6	1826	December 8, 2021
Need help for data training Rasa Open Source	6	463	March 13, 2020
Writing a custom component to preprocess text Rasa Open Source	0	664	July 27, 2022

Rasa com

Related topics