Usage of pre-trained BytePairEmbeddings and BERT embeddings in Rasa

Hi there, I’m looking for how to use pre-trained embeddings in my NLU pipeline. (note: I’m using Rasa 1.10.24)

First I tried BERT (which I get the error message way below) and then BPE but BPE works on rasa-nlu-examples which is a problem for me. Documentation says:

Cached Usage

If you’re using pre-downloaded embedding files (in docker you might have this on a mounted disk) then you can prevent a download from happening. We’ll be doing that in the example below.

language: en

pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
  lang: en
  vs: 10000
  dim: 100
  cache_dir: "tests/data"
- name: DIETClassifier
  epochs: 100

Note that in this case we expect two files to be present in the tests/data directory;

  • en.wiki.bpe.vs10000.d100.w2v.bin
  • en.wiki.bpe.vs10000.model

I’ve applied this pipeline, yet it requires rasa_nlu_examples to be installed which is not a lightweight dependency. My questions are:

  • Is there an easier way of doing this?
  • There’s a separate repository where we do the API work and I take the model there and it’s done, do I have to add rasa_nlu_examples into requirements.txt in production? (Find it overkill and therea are too many dependency conflicts which cause too much technical debt imo)
  • I couldn’t see compatibility matrix but when I did a pip install rasa-nlu-examples it downloaded rasa 2.7.1 for me which is something I don’t want.
  • I couldn’t use BERT embeddings simply because the models are way too limited. It told me it couldn’t find it though I’ve given link to model in huggingface. I saw so many people encountering the same problem. Is it solved? I get this error:

2021-06-29 20:46:54 ERROR transformers.tokenization_utils - Model name ‘asafaya/bert-base-arabic at main’ was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc). We assumed ‘asafaya/bert-base-arabic at main’ was a path or url but couldn’t find tokenizer filesat this path or url.

I tried with name: “asafaya/bert-base-arabic”, “bert-base-arabic” as well and it couldn’t find the model though there’s tokenizer and model files over there. There’s no model in spaCy for Arabic and fast-text embeddings are too heavy. I’m trying to find a solution here because my bot is too generic and it desperately needs pre-trained embeddings. Any help is appreciated.

Pinging @koaning here as he has experience with various embeddings. :slight_smile:

1 Like

I’ve already seen this and applied and this doesn’t answer my question. Please read it carefully.

If you’re only using the bytepair embeddings you can use the base installation of rasa-nlu-examples which is relatively light-weight considering you already have Rasa installed. It only adds rich, gensim and bpemb. The other deps are all optionally installed.

That said, if you insist on keeping it lightweight and don’t want to force a Rasa version, you can also just copy the bpemb_featurizer.py file locally and refer to it from your config. If you have bpemb_featurizer.py in the root of your project folder you can use it in your config.yml via;

- name: bpemb_featurizer.BytePairFeaturizer
  lang: en
  vs: 10000
  dim: 100

Note that there were some internal changes to Rasa when we moved to 2.x. If you want to have a 1.x compatible component you’ll need to find it in the GitHub releases.

Just to confirm, you’re trying to use the LanguageModelFeaturizer here? I’m a bit confused since the pipeline you shared earlier is for English while the error suggests Arabic. Note that for multi-lingual situations I might recommend LaBSE. There’s a video on how it works here and it’s documented here.

I’ve shared that “en” pipeline to quote the documentation. I’m trying to use Arabic (which I don’t know if sub-word tokenization would work, I’m trying to be honest). That’s why I’d rather use BERT. I’ll check LABSE out.

BytePair supports Arabic though. Simply set the language to ar. If you’re interested in spelling robustness I’d keep the vocabulary size small. I’d start with 1000, maybe 3000.

I had this intuition that BytePair mostly works for languages like English, or Turkish (which has tons of suffixes) but not for languages like Chinese where you have these characters (as words) and they come together and make a whole different word. Arabic works in a bit different way, there are stems like following: k-t-b (to write) is a stem, it makes up “kitab” which is book, it makes up “katib” which is “writer” you can conjugate to “yaktaba” and so on. (I learned these fancy stuff thanks to my job) I wonder how BPE would perform in this kind of language.

Last question: Is there a compatibility matrix for rasa-nlu-examples?

There’s not really a compatibility matrix. We’re mainly aiming to support the latest version of Rasa on that project.

In case you’re interested, there’s an Arabic benchmarks article on the whatlies docs where we compare bytepair embeddings with a huggingface model for a sentiment analysis task.

Yes I realized that, and we currently use 1.10.24 so I couldn’t use BPE (some dependencies like rasa.shared didn’t allow me to do it). Instead, I’m trying to download BERT models and use cached versions to go with BERT instead. (As Rasa can’t find the models which brought me back to my previous issue here)

Even the zip file over here doesn’t have a compatible component?

Oh I didn’t know it was backwards compatible, I’ll check. I did solve my problem by using bert-base-arabic cached.

It’s semi-backwards compatible. We don’t throw away old versions, but we don’t support the old versions either.