I was unable to find a working solution on the Rasa Forum for implementing the tokenizer associated with Hugging Face model weights. For example, my config:
recipe: default.v1
language: ru
pipeline:
- name: WhitespaceTokenizer
- name: LanguageModelFeaturizer
model_weights: "ai-forever/sbert_large_mt_nlu_ru"
model_name: "bert"
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 2
max_ngram: 5
- name: CountVectorsFeaturizer
analyzer: char
min_ngram: 3
max_ngram: 10
- name: DIETClassifier
epochs: 75
constrain_similarities: true
number_of_transformer_layers: 4
embedding_dimension: 256
random_seed: 42
- name: ResponseSelector
epochs: 100
constrain_similarities: true
- name: FallbackClassifier
threshold: 0.7
policies:
- name: MemoizationPolicy
- name: RulePolicy
- name: UnexpecTEDIntentPolicy
max_history: 5
epochs: 100
- name: TEDPolicy
max_history: 5
epochs: 100
random_seed: 42
constrain_similarities: true
It should be LanguageModelTokenizer but it’s been deprecated in Rasa 3.x Then, LanguageModelFeaturizer requires a Tokenizer (Rasa Tokenizer Documentation) Why is it necessary to specify an additional tokenizer when the LanguageModelFeaturizer already includes an LM tokenizer?