How to train DIET architecture on foreign language

Hello, i am trying to train DIET model on georgian language, unfortunately spicy doesn’t support georgian language and also bert isn’t trained on it. Fortunately hugginface has model called XML Roberta which has it’s own tokenizer, can you please tell me if it is possible to replace bert with XML Roberta (from huggingface).Can i also use XML Roberta’s tokenizer instead of spicy?

Could you share the config.yml that you tried to run, but didn’t work? Could you also share the huggingface model that you tried to run?

One thing about tokenizers in BERT models … they produce sub-tokens. These don’t represent words, rather, they represent parts of words. The token geology may be translated to geo, l, ology internally before BERT creates the embedding. This is why Rasa’s implementation of BERT models only exposes the embeddings and not the tokens to the pipeline.

One extra thing: I maintain a library called rasa-nlu-examples which supports many word embeddings for Non-English. In particular, the bytepair embeddings are a lightweight alternative to BERT models and it seems to properly support Georgian. The benefit is that these embeddings are much much lighter.

A final thing: although I understand the feeling of “I should add embeddings to my pipeline!” I might recommend against it. Especially when you’re just starting out odds are that you won’t have a big enough dataset to run proper benchmarks. In such a situation I would recommend first making a basic assistant and just showing it to users as soon as possible. The feedback you’ll get from users will be more meaningful in the long run because you’ll learn about missing intents and other hints.

thanks for reply, currently i am begginer, and trying different things, your answer helped a lot, i will try your suggestions and tell the result

Hello, I wanted to try XML roberta from huggingface (because it is trained on georgian language), but when i looked at the documentation, it seems that rasa doesn’t support xml roberta, can you please tell me how to use XML roberta in huggingface instead of bert?

Could you share what you tried? Does our LanguageModelFeaturizer not support it?

I tried running my training with

name: LanguageModelFeaturizer

model_name: “xlm-roberta-base”

But getting this: KeyError: “‘xlm-roberta-base’ not a valid model name. Choose from [‘bert’, ‘gpt’, ‘gpt2’, ‘xlnet’, ‘distilbert’, ‘roberta’] or createa new class inheriting from this class to support your model.”

The model_name refers to an architecture. I think in your case you’d like to run a roberta model, and in particular … the one with xlm-roberta-base trained weights.

I’m not that familiar with Huggingface models, but could you try using;

  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "xlm-roberta-base"

Also, could you explain why you feel like you really need these BERT style featurizers? Do you have enough train/test data for a proper benchmark? If not, I’d recommend collecting more data first.

Thanks @koaning, I tried that, xlm-roberta-base is not among model weights either. I suppose I could try downloading them and passing them as a local path as well.

We have a decent amount of data but definitely would help to get more but at the same time we are trying to understand/assess all the options we have including using BERT.

I’ve added an issue to Github to clarify these docs, I agree this should be explained better. I’m personally swamped with Rasa 3.0 release work at the moment, but I hope to be able to address it later in December.