Error when building pipeline with LanguageModelFeaturizer in lanauge zh

p768lwy3 · March 23, 2021, 5:45am

Hi everyone, I am new to rasa and I have faced the following issue when “Starting to train component DIETClassifier” in the pipeline building process with the given configuration:

tensorflow.python.framework.errors_impl.InvalidArgumentError: All dimensions except 2 must match. Input 1 has shape [64 8 768] and doesn’t match input 0 with shape [64 11 128].

The configuration is:

language: zh

pipeline:

name: JiebaTokenizer dictionary_path: ./jieba_userdict

name: LanguageModelFeaturizer model_name: bert model_weights: bert-base-chinese cache_dir: null

name: RegexFeaturizer

name: DIETClassifier

name: EntitySynonymMapper

name: ResponseSelector epochs: 100

I have tried a “language: en” model with LanguageModelFeaturizer, and it runs perfectly. Is it mean the LanguageModelFeaturizer doesn’t support “language: zh”?

Thanks a lot!

Jasper

koaning · March 23, 2021, 4:17pm

Hi Jasper,

Strange, that shouln’t happen. I’m wondering what is happening here.

Just to confirm, if you remove the LanguageModelFeaturizer component, does the error persist? I’m wondering if there’s a mismatch between the tokeniser and the language model.

Could you confirm the Rasa version here? Also the huggingface version?

rasa --version
pip freeze | grep huggingface

Related, I just merged a PR for spaCy. Soon, you should also be able to get pre-trained language models for chinese via spaCy tooling as well.

p768lwy3 · March 24, 2021, 10:01am

Yes, no error is raised after remove the LanguageModelFeaturizer component and the training is completed smoothly.

For the rasa --version, I have got:

> Rasa SDK Version : 2.4.0
> Rasa X Version   : None
> Python Version   : 3.8.2

and for pip freeze | grep huggingface, I cannot find the package but I have got transformers in pip: transformers==4.2.2

Happy to know the update for spaCy! I tried it before and I was quite sad when it told me SpacyNLP did not support Chinese currently.

Thanks a lot!

koaning · March 24, 2021, 11:09am

Strange.

One thing to perhaps try out, could you try using these settings?

pipeline:
  - name: LanguageModelFeaturizer
    # Name of the language model to use
    model_name: "bert"
    # Pre-Trained weights to be loaded
    model_weights: "rasa/LaBSE"

I’m mentioning this model because it is explicitly mentioned in our docs and the LaBSE model is trained for the multi-language base. One of the languages it was trained on is Chinese, so this might be an alternative to try.

p768lwy3 · April 7, 2021, 6:15am

Sorry for the late reply and I have tried your setting with LaBSE finally, but it gave me the same issue as the BERT model when training the component of DIETClassifier:

tensorflow.python.framework.errors_impl.InvalidArgumentError:  All dimensions except 2 must match. Input 1 has shape [64 13 768] and doesn't match input 0
 with shape [64 21 128].
         [[node gradient_tape/ConcatOffset_1 (defined at C:\Users\p768l\AppData\Roaming\Python\Python38\site-packages\rasa\utils\tensorflow\models.py:157)
 ]] [Op:__inference_train_function_54635]

Function call stack:
train_function

And for your information, I can train and run the pipeline with both BERT and LaBSE if I replace the DIETClassifier with SklearnIntentClassifier and CRFEntityExtractor.

Thanks a lot!

koaning · April 7, 2021, 6:33am

Hi Wai Yin Li,

I’m now wondering if this is perhaps a bug that we should investigate. Is it possible for you to send me a minimum viable example of nlu.yml and config.yml that I might be able to run locally? If I can confirm the error I’ll gladly start a GitHub issue for it.

kearnsw · May 12, 2021, 2:38pm

The error is originating from the concatenation of sequence features in DIET Classifier from two featurizers that differ in the number of tokens (batch size x tokens x embedding).

I am able to reproduce with a custom component that uses a tokenizer not in the NLU pipeline. There is likely a similar issue occurring between the Jieba Tokenizer and the tokenizer in the LanguageModelFeaturizer.

@p768lwy3, did using the spacy tokenizer resolve this issue?

Star007 · December 9, 2021, 3:07pm

I also encountered this error. After a long time of modification, it was still useless, so I had to give up using Diet

wbrinki · March 25, 2022, 8:42am

I encountered the same problem when testing with cross-validation. The curious thing was that some models worked others didn’t. After a long time investigating I found the problem in my nlu data. We use automated scripts that read our excel files with the nlu data and write them in rasa nlu format and also label our entities in the process. For some reason the scripts sometimes introduce a space between the words that is not a whitespace. You will only see it when your IDE is set to show the white-spaces. Maybe this helps when you are investigating.

Topic		Replies	Views
LanguageModelFeaturizer in pipeline dont work! Rasa Open Source	2	687	May 12, 2022
Rasa 3.0 Error on train model (LanguageModelFeaturizer , bert) Rasa Open Source	1	963	April 1, 2022
When using LanguageModelFeaturizer to load a pre-trained model during rasa train nlu, an error is reported: "Error initializing graph component for node run_LanguageModelFeaturizer1." Rasa Open Source	0	630	June 18, 2023
Chinese whitespace error Rasa Open Source	2	963	August 5, 2022
Support for Language Models inside Rasa Release Announcements community , rasa	25	12765	November 25, 2021

Error when building pipeline with LanguageModelFeaturizer in lanauge zh

Related topics