Ok, I see what’s happening here. In these two functions in rasa.nlu.training_data.message, you see how the text
attribute is handled differently than the other attributes? so message.data["text"]
will be what you’re expecting, but message.text
will be left untouched.
def set(self, prop, info, add_to_output=False) -> None:
self.data[prop] = info
if add_to_output:
self.output_properties.add(prop)
def get(self, prop, default=None) -> Any:
if prop == TEXT_ATTRIBUTE:
return self.text
return self.data.get(prop, default)
You can get around this by doing:
def process(self, message: Message, **kwargs: Any) -> None:
message.text = self.tokenize(message.text)
However, since you’re creating a tokenizer, it would make sense for your class to return tokens, not text, which would be handled correctly. Using WhitespaceTokenizer
over pre-tokenized text wouldn’t really make sense, in my opinion.
For example, I passed the sentence “Are you a bot?” to the model, and Lemmatizationtokenizer returns
be -PRON- a bot?
But after WhitespaceTokenizer
is done with it, you get
be pron a bot?