Pre-trained BERT embeddings weight shape inconsistency

Hi there,

I’m trying to use AraBERTv2 in huggingface. When I directly give the model name, rasa can’t find it, so I downloaded the model files and I’m currently using the model from cache. The model name “bert” has this pre-defined shape in it’s weights, which is inconsistent with AraBERT’s shape, which causes the following error when I try to train NLU:

ValueError: Layer #0 (named "bert"), weight <tf.Variable 'tf_bert_model/bert/embeddings/word_embeddings/weight:0' shape=(64000, 768) dtype=float32, numpy=
array([[ 0.01264049,  0.01955802, -0.00695472, ...,  0.02972408,
         0.02346177,  0.02438015],
       [-0.01304033,  0.02966203, -0.00747581, ..., -0.01203315,
        -0.00375808,  0.01051331],
       [-0.00813525, -0.00043884,  0.01635659, ...,  0.00520323,
         0.03812745, -0.02500577],
       ...,
       [ 0.00136639, -0.01157611, -0.00390291, ..., -0.00182731,
        -0.00883772,  0.00762005],
       [-0.00144503,  0.00418852, -0.00485313, ...,  0.00467847,
         0.00246883,  0.03807047],
       [-0.01269101, -0.00266637,  0.0104845 , ...,  0.01443424,
         0.00744028,  0.0108838 ]], dtype=float32)> has shape (64000, 768), but the saved weight has shape (28996, 768).
I came across this in the huggingface forum but since I'm not loading the model just like I do from huggingface for any other NLP task (with .from_pretrained() which seems the solution) I don't know if there's a solution for this in Rasa.

Here’s the link to that issue: load tf2 roberta model meet error #2598 In the meanwhile my plan is to try bert-base-arabic instead of this model.

Any help is appreciated.

Hi @merveenoyan , it looks like the model you are trying to use has a different layer size than the ones present in the standard bert model. It’s definitely not possible to load it in Rasa right now if that’s the case. However, the default model weights that LanguageModelFeaturizer loads up are from LaBSE model which is also trained on Arabic. We’ve found that the sentence embeddings from this model are very useful for tasks like intent classification. So, I’d encourage you to try using that too. :slight_smile:

Can you please confirm that I should do it like this (I’m using 1.10.24)

language: ar

> pipeline:
>   - name: LanguageModelFeaturizer
>     # Name of the language model to use
>     model_name: "bert"
>     # Pre-Trained weights to be loaded
>     model_weights: "rasa/LaBSE"
>     cache_dir: null

Does this get the Arabic LABSE weights from huggingface directly or should I do something additional like pip install rasa[labse]?

Also if anyone wonders I did use cached model weights yet it performed really bad. :’) The intent classification accuracy was zero. (0) :smiley:

You can install rasa with pip install rasa[transformers] (append your specific version)

Config:

language: ar
pipeline:

   - name: LanguageModelTokenizer
     model_name: "bert"
     model_weights: "rasa/LaBSE"
   - name: LanguageModelFeaturizer
     model_name: "bert"
     model_weights: "rasa/LaBSE"

...

This will get the weights for the LaBSE model. There are no arabic specific weights for the LaBSE model. It’s a multi-lingual model which is trained on arabic as well and hence you can use it here too. Let me know if you run into any problems.