How to train DIET architecture on foreign language

datonefaridze · November 3, 2021, 1:33pm

Hello, i am trying to train DIET model on georgian language, unfortunately spicy doesn’t support georgian language and also bert isn’t trained on it. Fortunately hugginface has model called XML Roberta which has it’s own tokenizer, can you please tell me if it is possible to replace bert with XML Roberta (from huggingface).Can i also use XML Roberta’s tokenizer instead of spicy?

koaning · November 4, 2021, 8:47am

Could you share the config.yml that you tried to run, but didn’t work? Could you also share the huggingface model that you tried to run?

One thing about tokenizers in BERT models … they produce sub-tokens. These don’t represent words, rather, they represent parts of words. The token geology may be translated to geo, l, ology internally before BERT creates the embedding. This is why Rasa’s implementation of BERT models only exposes the embeddings and not the tokens to the pipeline.

One extra thing: I maintain a library called rasa-nlu-examples which supports many word embeddings for Non-English. In particular, the bytepair embeddings are a lightweight alternative to BERT models and it seems to properly support Georgian. The benefit is that these embeddings are much much lighter.

A final thing: although I understand the feeling of “I should add embeddings to my pipeline!” I might recommend against it. Especially when you’re just starting out odds are that you won’t have a big enough dataset to run proper benchmarks. In such a situation I would recommend first making a basic assistant and just showing it to users as soon as possible. The feedback you’ll get from users will be more meaningful in the long run because you’ll learn about missing intents and other hints.

datonefaridze · November 8, 2021, 12:46pm

thanks for reply, currently i am begginer, and trying different things, your answer helped a lot, i will try your suggestions and tell the result

datonefaridze · November 24, 2021, 8:22am

Hello, I wanted to try XML roberta from huggingface (because it is trained on georgian language), but when i looked at the documentation, it seems that rasa doesn’t support xml roberta, can you please tell me how to use XML roberta in huggingface instead of bert?

koaning · November 24, 2021, 10:01am

Could you share what you tried? Does our LanguageModelFeaturizer not support it?

anamamatelashvili · November 25, 2021, 6:02am

I tried running my training with

name: LanguageModelFeaturizer

model_name: “xlm-roberta-base”

But getting this: KeyError: “‘xlm-roberta-base’ not a valid model name. Choose from [‘bert’, ‘gpt’, ‘gpt2’, ‘xlnet’, ‘distilbert’, ‘roberta’] or createa new class inheriting from this class to support your model.”

koaning · November 25, 2021, 8:50am

The model_name refers to an architecture. I think in your case you’d like to run a roberta model, and in particular … the one with xlm-roberta-base trained weights.

I’m not that familiar with Huggingface models, but could you try using;

  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "xlm-roberta-base"

Also, could you explain why you feel like you really need these BERT style featurizers? Do you have enough train/test data for a proper benchmark? If not, I’d recommend collecting more data first.

anamamatelashvili · November 26, 2021, 5:54am

Thanks @koaning, I tried that, xlm-roberta-base is not among model weights either. I suppose I could try downloading them and passing them as a local path as well.

We have a decent amount of data but definitely would help to get more but at the same time we are trying to understand/assess all the options we have including using BERT.

koaning · November 26, 2021, 10:31am

I’ve added an issue to Github to clarify these docs, I agree this should be explained better. I’m personally swamped with Rasa 3.0 release work at the moment, but I hope to be able to address it later in December.

github.com/RasaHQ/rasa

Improve Docs for LanguageModelFeaturizer

opened 09:58AM - 25 Nov 21 UTC

koaning

type:docs :book: area:rasa-oss/ml/nlu-components

There's [a](http://forum.rasa.com/t/support-for-language-models-inside-rasa/265…03/25) [lot](http://forum.rasa.com/t/problem-when-using-transformer-in-nlu-pipeline/49475/2) of [confusion](http://forum.rasa.com/t/how-to-train-diet-architecture-on-foreign-language/48968/6) on the forum this week on how to configure the `LanguageModelFeautizer`. This suggests that we may want to add documentation that explains how to link models. I think part of the issue here is that it's hard for a user to know what the `model_name` of any given model on huggingface might be.

Topic		Replies	Views
Support for Language Models inside Rasa Release Announcements community , rasa	25	12756	November 25, 2021
How to import huggingface models to Rasa? Rasa Open Source	12	4847	December 27, 2021
Xml-roberta in pipeline Rasa Open Source	4	686	July 5, 2021
Use recent release google bert -MuRIL Rasa Open Source	6	882	January 8, 2021
No Difference in Performance when Using or Changing Language Model Featurizers Rasa Open Source	3	1253	January 17, 2022

How to train DIET architecture on foreign language

Related topics