Hugging Face custom Tokenizer

I was unable to find a working solution on the Rasa Forum for implementing the tokenizer associated with Hugging Face model weights. For example, my config:

recipe: default.v1
language: ru
pipeline:
- name: WhitespaceTokenizer
- name: LanguageModelFeaturizer
  model_weights: "ai-forever/sbert_large_mt_nlu_ru"
  model_name: "bert"
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 2
  max_ngram: 5
- name: CountVectorsFeaturizer
  analyzer: char
  min_ngram: 3
  max_ngram: 10
- name: DIETClassifier
  epochs: 75
  constrain_similarities: true
  number_of_transformer_layers: 4
  embedding_dimension: 256
  random_seed: 42
- name: ResponseSelector
  epochs: 100
  constrain_similarities: true
- name: FallbackClassifier
  threshold: 0.7
policies:
- name: MemoizationPolicy
- name: RulePolicy
- name: UnexpecTEDIntentPolicy
  max_history: 5
  epochs: 100
- name: TEDPolicy
  max_history: 5
  epochs: 100
  random_seed: 42
  constrain_similarities: true

It should be LanguageModelTokenizer but it’s been deprecated in Rasa 3.x Then, LanguageModelFeaturizer requires a Tokenizer (Rasa Tokenizer Documentation) Why is it necessary to specify an additional tokenizer when the LanguageModelFeaturizer already includes an LM tokenizer?

We have to think this thing like this: Rasa needs two steps to understand text:

  1. A tokenizer like WhitesppaceTokenizer breaks down the next into smaller pieces.

  2. The LanguageModelFeaturizer uses its special tools to make sense of these pieces.

It’s like having a helper who first cuts up your food into bite-sized pieces before you eat it. Both steps are important for Rasa to learn from the text properly.

I hope my suggestion will be helpful for everyone.