Hi @ganbaa_elmer, I think the error you’re seeing comes from the way how the model name and weights are mapped to the corresponding Huggingface classes. I tested this with Rasa version 3.0.2. and config
- name: LanguageModelFeaturizer
model_name: bert
model_weights: tugstugi/bert-base-mongolian-uncased
and am getting an error as well. If you’re using a different Rasa version the concrete reason might be different though.
According to here, if you specify model: bert
, Rasa tries to initialize a BertTokenizer
from the given weights (in your case tugstugi/bert-base-mongolian-uncased
). However, when checking which kind of tokenizer is actually used by this model directly in HF transformers
using
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("tugstugi/bert-base-mongolian-uncased")
print(type(tok))
you get
<class 'transformers.models.albert.tokenization_albert_fast.AlbertTokenizerFast'>
Therefore there seems to be a mismatch between the tokenizer that the model uses and the one Rasa is trying to load. Since the mapping from model to tokenizer class is hard-coded, I think it is currently only possible to use Bert models that also make use of the BertTokenizer
. This is not transparent from the documentation and hard to see on the HF model hub. I would suggest to open a ticket to improve the documentation on that.
As an alternative, if you’re looking for dense embeddings in Mongolian, you could also try using the BytePairFeaturizer from rasa-nlu-examples
, which has a Mongolian model of dense sub-word embeddings. See here for installation and usage instructions.