Hugging Face custom Tokenizer

sokoloveav · February 22, 2024, 7:22pm

I was unable to find a working solution on the Rasa Forum for implementing the tokenizer associated with Hugging Face model weights. For example, my config:

recipe: default.v1
language: ru
pipeline:
- name: WhitespaceTokenizer
- name: LanguageModelFeaturizer
  model_weights: "ai-forever/sbert_large_mt_nlu_ru"
  model_name: "bert"
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 2
  max_ngram: 5
- name: CountVectorsFeaturizer
  analyzer: char
  min_ngram: 3
  max_ngram: 10
- name: DIETClassifier
  epochs: 75
  constrain_similarities: true
  number_of_transformer_layers: 4
  embedding_dimension: 256
  random_seed: 42
- name: ResponseSelector
  epochs: 100
  constrain_similarities: true
- name: FallbackClassifier
  threshold: 0.7
policies:
- name: MemoizationPolicy
- name: RulePolicy
- name: UnexpecTEDIntentPolicy
  max_history: 5
  epochs: 100
- name: TEDPolicy
  max_history: 5
  epochs: 100
  random_seed: 42
  constrain_similarities: true

It should be LanguageModelTokenizer but it’s been deprecated in Rasa 3.x Then, LanguageModelFeaturizer requires a Tokenizer (Rasa Tokenizer Documentation) Why is it necessary to specify an additional tokenizer when the LanguageModelFeaturizer already includes an LM tokenizer?

JamieC1 · March 23, 2024, 5:31am

We have to think this thing like this: Rasa needs two steps to understand text:

A tokenizer like WhitesppaceTokenizer breaks down the next into smaller pieces.
The LanguageModelFeaturizer uses its special tools to make sense of these pieces.

It’s like having a helper who first cuts up your food into bite-sized pieces before you eat it. Both steps are important for Rasa to learn from the text properly.

JamieC1 · March 26, 2024, 4:40am

I hope my suggestion will be helpful for everyone.

Topic		Replies	Views
How to import huggingface models to Rasa? Rasa Open Source	12	4846	December 27, 2021
LanguageModelFeaturizer in pipeline dont work! Rasa Open Source	2	686	May 12, 2022
Clarification on Model Weights Getting Started with Rasa	2	329	November 23, 2020
Correct tokenizer for BERT Rasa/LaBSE Rasa Open Source	2	264	January 17, 2025
Rasa 3.x custom bert Rasa Open Source	1	723	May 9, 2022

Hugging Face custom Tokenizer

Related topics