with the LanguageModelTokenizer being deprecated and the LanguageModelFeaturizer implementing its behavior, I am wondering which effect using any tokenizer in the pipeline has to the outcome.
To my understanding the the LanguageModelFeaturizer does the tokenization, so it should get the complete examples as input. Is that right? If so are the tokens from the arbitrary tokenizer component used in any step?
Then I don’t understand how it implements the behavior of the LanguageModelTokenizer and how when for example using a BERT model in the LMFeaturizer the BERT model gets the token in the trained format?
why does it require a tokenizer before the languagemodefeaturizer rather than just using the tokenizer built in to the language model? and if it does then what criteria should be used to select the tokenizer?
I’m also interested to know - LanguageModelFeaturizer requires a Tokenizer (Components). If there is already a LM tokenizer in the LanguageModelFeaturizer, why do we need to specify a separate tokenizer?