Purpose of Tokenizer when using LanguageModelFeaturizer

Hello Rasa Team,

with the LanguageModelTokenizer being deprecated and the LanguageModelFeaturizer implementing its behavior, I am wondering which effect using any tokenizer in the pipeline has to the outcome.

To my understanding the the LanguageModelFeaturizer does the tokenization, so it should get the complete examples as input. Is that right? If so are the tokens from the arbitrary tokenizer component used in any step?

Thank you

LanguageModelFeaturizer doesn’t use the text directly, it uses the texts of tokens that are returned by Tokenizer

Then I don’t understand how it implements the behavior of the LanguageModelTokenizer and how when for example using a BERT model in the LMFeaturizer the BERT model gets the token in the trained format?

it takes tokens, then feed its text into lm tokenizer, it might or might not additionally tokenize this text, then whatever lm tokenizer is fed into lm featurizer

why does it require a tokenizer before the languagemodefeaturizer rather than just using the tokenizer built in to the language model? and if it does then what criteria should be used to select the tokenizer?

1 Like

I’m also interested to know - LanguageModelFeaturizer requires a Tokenizer (Components). If there is already a LM tokenizer in the LanguageModelFeaturizer, why do we need to specify a separate tokenizer?