Hi there, I’m looking for how to use pre-trained embeddings in my NLU pipeline. (note: I’m using Rasa 1.10.24)
First I tried BERT (which I get the error message way below) and then BPE but BPE works on rasa-nlu-examples which is a problem for me. Documentation says:
If you’re using pre-downloaded embedding files (in docker you might have this on a mounted disk) then you can prevent a download from happening. We’ll be doing that in the example below.
language: en pipeline: - name: WhitespaceTokenizer - name: LexicalSyntacticFeaturizer - name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4 - name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer lang: en vs: 10000 dim: 100 cache_dir: "tests/data" - name: DIETClassifier epochs: 100
Note that in this case we expect two files to be present in the
I’ve applied this pipeline, yet it requires rasa_nlu_examples to be installed which is not a lightweight dependency. My questions are:
- Is there an easier way of doing this?
- There’s a separate repository where we do the API work and I take the model there and it’s done, do I have to add rasa_nlu_examples into requirements.txt in production? (Find it overkill and therea are too many dependency conflicts which cause too much technical debt imo)
- I couldn’t see compatibility matrix but when I did a pip install rasa-nlu-examples it downloaded rasa 2.7.1 for me which is something I don’t want.
- I couldn’t use BERT embeddings simply because the models are way too limited. It told me it couldn’t find it though I’ve given link to model in huggingface. I saw so many people encountering the same problem. Is it solved? I get this error:
2021-06-29 20:46:54 ERROR transformers.tokenization_utils - Model name ‘asafaya/bert-base-arabic at main’ was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc). We assumed ‘asafaya/bert-base-arabic at main’ was a path or url but couldn’t find tokenizer filesat this path or url.
I tried with name: “asafaya/bert-base-arabic”, “bert-base-arabic” as well and it couldn’t find the model though there’s tokenizer and model files over there. There’s no model in spaCy for Arabic and fast-text embeddings are too heavy. I’m trying to find a solution here because my bot is too generic and it desperately needs pre-trained embeddings. Any help is appreciated.
Pinging @koaning here as he has experience with various embeddings.