Purpose of Tokenizer when using LanguageModelFeaturizer

Berken · February 1, 2021, 9:17am

Hello Rasa Team,

with the LanguageModelTokenizer being deprecated and the LanguageModelFeaturizer implementing its behavior, I am wondering which effect using any tokenizer in the pipeline has to the outcome.

To my understanding the the LanguageModelFeaturizer does the tokenization, so it should get the complete examples as input. Is that right? If so are the tokens from the arbitrary tokenizer component used in any step?

Thank you

Ghostvv · February 1, 2021, 1:02pm

LanguageModelFeaturizer doesn’t use the text directly, it uses the texts of tokens that are returned by Tokenizer

Berken · February 1, 2021, 2:23pm

Then I don’t understand how it implements the behavior of the LanguageModelTokenizer and how when for example using a BERT model in the LMFeaturizer the BERT model gets the token in the trained format?

Ghostvv · February 1, 2021, 4:38pm

it takes tokens, then feed its text into lm tokenizer, it might or might not additionally tokenize this text, then whatever lm tokenizer is fed into lm featurizer

simonm3 · November 12, 2021, 4:32pm

why does it require a tokenizer before the languagemodefeaturizer rather than just using the tokenizer built in to the language model? and if it does then what criteria should be used to select the tokenizer?

juan57 · November 19, 2021, 8:02pm

I’m also interested to know - LanguageModelFeaturizer requires a Tokenizer (Components). If there is already a LM tokenizer in the LanguageModelFeaturizer, why do we need to specify a separate tokenizer?

Topic		Replies	Views
Why is tokenization needed before a LanguageModelFeaturizer? Rasa Open Source	0	212	April 11, 2023
Hugging Face custom Tokenizer Rasa Open Source	2	336	March 26, 2024
Correct tokenizer for BERT Rasa/LaBSE Rasa Open Source	2	266	January 17, 2025
LanguageModelFeaturizer in pipeline dont work! Rasa Open Source	2	687	May 12, 2022
No Difference in Performance when Using or Changing Language Model Featurizers Rasa Open Source	3	1254	January 17, 2022

Purpose of Tokenizer when using LanguageModelFeaturizer

Related topics