Rasa com

Ok, I see what’s happening here. In these two functions in rasa.nlu.training_data.message, you see how the text attribute is handled differently than the other attributes? so message.data["text"] will be what you’re expecting, but message.text will be left untouched.

    def set(self, prop, info, add_to_output=False) -> None:
        self.data[prop] = info
        if add_to_output:
            self.output_properties.add(prop)

    def get(self, prop, default=None) -> Any:
        if prop == TEXT_ATTRIBUTE:
            return self.text
        return self.data.get(prop, default)

You can get around this by doing:

    def process(self, message: Message, **kwargs: Any) -> None:
        message.text = self.tokenize(message.text)

However, since you’re creating a tokenizer, it would make sense for your class to return tokens, not text, which would be handled correctly. Using WhitespaceTokenizer over pre-tokenized text wouldn’t really make sense, in my opinion.

For example, I passed the sentence “Are you a bot?” to the model, and Lemmatizationtokenizer returns

be -PRON- a bot?

But after WhitespaceTokenizer is done with it, you get

be   pron   a bot?