Correct tokenizer for BERT Rasa/LaBSE

I understand when using pretrained models like BERT you have to use the exact same tokenizer the model has been trained for. For BERT this is the word piece tokenizer, probably the same is true for Rasa/LaBSE (?)

In version 2.x of Rasa Open Source this tokenizer seems to have been fetched automatically by the HFTransformersNLP component. Since version 3.x the new component LanguageModelFeaturizer doesn’t seem to do this anymore. The docs say “Include a Tokenizer component before this component.”

On the other hand the doc goes on to say, models can only be used if “The model uses the default tokenizer”, which seems to indicate that this default tokenizer will be used by the Rasa pipeline irrespectively of the selected Tokenizer component.

Could anyone please clarify

  1. Whether the Tokenizer component in the pipeline is used for BERT Rasa/LaBSE as LanguageModelFeaturizer
  2. What tokenizer needs to be used and
  3. How that can be achieved?

Thanks so much!

Hi By looking at the source code for the lm_featurizer at NLU’s dense featurizer module we can understand that internally when loading LanguageModelFeaturizer component in pipeline, rasa automatically downlods the language model tokenizer using hugginface SDK.

there’s no need to add any specific tokenizer to use LangaudeModelFeatruizer, just add at least one tokenizer to prevent invalid config file and you’re fine.

1 Like

Thanks so much for looking into it!

1 Like