Usage of pre-trained BytePairEmbeddings and BERT embeddings in Rasa

merveenoyan · June 29, 2021, 6:22pm

Hi there, I’m looking for how to use pre-trained embeddings in my NLU pipeline. (note: I’m using Rasa 1.10.24)

First I tried BERT (which I get the error message way below) and then BPE but BPE works on rasa-nlu-examples which is a problem for me. Documentation says:

Cached Usage

If you’re using pre-downloaded embedding files (in docker you might have this on a mounted disk) then you can prevent a download from happening. We’ll be doing that in the example below.

language: en

pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
  lang: en
  vs: 10000
  dim: 100
  cache_dir: "tests/data"
- name: DIETClassifier
  epochs: 100

Note that in this case we expect two files to be present in the tests/data directory;

en.wiki.bpe.vs10000.d100.w2v.bin
en.wiki.bpe.vs10000.model

I’ve applied this pipeline, yet it requires rasa_nlu_examples to be installed which is not a lightweight dependency. My questions are:

Is there an easier way of doing this?
There’s a separate repository where we do the API work and I take the model there and it’s done, do I have to add rasa_nlu_examples into requirements.txt in production? (Find it overkill and therea are too many dependency conflicts which cause too much technical debt imo)
I couldn’t see compatibility matrix but when I did a pip install rasa-nlu-examples it downloaded rasa 2.7.1 for me which is something I don’t want.
I couldn’t use BERT embeddings simply because the models are way too limited. It told me it couldn’t find it though I’ve given link to model in huggingface. I saw so many people encountering the same problem. Is it solved? I get this error:

2021-06-29 20:46:54 ERROR transformers.tokenization_utils - Model name ‘asafaya/bert-base-arabic at main’ was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc). We assumed ‘asafaya/bert-base-arabic at main’ was a path or url but couldn’t find tokenizer filesat this path or url.

I tried with name: “asafaya/bert-base-arabic”, “bert-base-arabic” as well and it couldn’t find the model though there’s tokenizer and model files over there. There’s no model in spaCy for Arabic and fast-text embeddings are too heavy. I’m trying to find a solution here because my bot is too generic and it desperately needs pre-trained embeddings. Any help is appreciated.

Pinging @koaning here as he has experience with various embeddings.

merveenoyan · June 29, 2021, 6:44pm

I’ve already seen this and applied and this doesn’t answer my question. Please read it carefully.

koaning · June 29, 2021, 7:15pm

If you’re only using the bytepair embeddings you can use the base installation of rasa-nlu-examples which is relatively light-weight considering you already have Rasa installed. It only adds rich, gensim and bpemb. The other deps are all optionally installed.

That said, if you insist on keeping it lightweight and don’t want to force a Rasa version, you can also just copy the bpemb_featurizer.py file locally and refer to it from your config. If you have bpemb_featurizer.py in the root of your project folder you can use it in your config.yml via;

- name: bpemb_featurizer.BytePairFeaturizer
  lang: en
  vs: 10000
  dim: 100

Note that there were some internal changes to Rasa when we moved to 2.x. If you want to have a 1.x compatible component you’ll need to find it in the GitHub releases.

koaning · June 29, 2021, 7:18pm

Just to confirm, you’re trying to use the LanguageModelFeaturizer here? I’m a bit confused since the pipeline you shared earlier is for English while the error suggests Arabic. Note that for multi-lingual situations I might recommend LaBSE. There’s a video on how it works here and it’s documented here.

merveenoyan · June 29, 2021, 7:47pm

I’ve shared that “en” pipeline to quote the documentation. I’m trying to use Arabic (which I don’t know if sub-word tokenization would work, I’m trying to be honest). That’s why I’d rather use BERT. I’ll check LABSE out.

koaning · June 29, 2021, 8:54pm

BytePair supports Arabic though. Simply set the language to ar. If you’re interested in spelling robustness I’d keep the vocabulary size small. I’d start with 1000, maybe 3000.

merveenoyan · June 29, 2021, 9:46pm

I had this intuition that BytePair mostly works for languages like English, or Turkish (which has tons of suffixes) but not for languages like Chinese where you have these characters (as words) and they come together and make a whole different word. Arabic works in a bit different way, there are stems like following: k-t-b (to write) is a stem, it makes up “kitab” which is book, it makes up “katib” which is “writer” you can conjugate to “yaktaba” and so on. (I learned these fancy stuff thanks to my job) I wonder how BPE would perform in this kind of language.

Last question: Is there a compatibility matrix for rasa-nlu-examples?

koaning · June 30, 2021, 11:36am

There’s not really a compatibility matrix. We’re mainly aiming to support the latest version of Rasa on that project.

In case you’re interested, there’s an Arabic benchmarks article on the whatlies docs where we compare bytepair embeddings with a huggingface model for a sentiment analysis task.

merveenoyan · June 30, 2021, 11:40am

Yes I realized that, and we currently use 1.10.24 so I couldn’t use BPE (some dependencies like rasa.shared didn’t allow me to do it). Instead, I’m trying to download BERT models and use cached versions to go with BERT instead. (As Rasa can’t find the models which brought me back to my previous issue here)

koaning · June 30, 2021, 3:15pm

Even the zip file over here doesn’t have a compatible component?

merveenoyan · July 1, 2021, 8:29am

Oh I didn’t know it was backwards compatible, I’ll check. I did solve my problem by using bert-base-arabic cached.

koaning · July 2, 2021, 8:28am

It’s semi-backwards compatible. We don’t throw away old versions, but we don’t support the old versions either.

Topic		Replies	Views
How to train Rasa for other language Rasa Open Source	32	4999	August 25, 2020
Using fasttext pretrained word emmbeding for other language Rasa Open Source	3	1013	August 18, 2020
Adding any API call of state of art models in RASA Pipeline Rasa Open Source	2	494	December 28, 2019
Rasa Nlu custom pipeline Rasa Open Source	4	1789	October 4, 2019
Problem when using transformer in NLU pipeline Rasa Open Source	8	2216	November 25, 2021

Usage of pre-trained BytePairEmbeddings and BERT embeddings in Rasa

Cached Usage

Related topics