Error when building pipeline with LanguageModelFeaturizer in lanauge zh

Hi everyone, I am new to rasa and I have faced the following issue when “Starting to train component DIETClassifier” in the pipeline building process with the given configuration:

tensorflow.python.framework.errors_impl.InvalidArgumentError: All dimensions except 2 must match. Input 1 has shape [64 8 768] and doesn’t match input 0 with shape [64 11 128].

The configuration is:

language: zh

pipeline:

  • name: JiebaTokenizer dictionary_path: ./jieba_userdict
  • name: LanguageModelFeaturizer model_name: bert model_weights: bert-base-chinese cache_dir: null
  • name: RegexFeaturizer
  • name: DIETClassifier
  • name: EntitySynonymMapper
  • name: ResponseSelector epochs: 100

I have tried a “language: en” model with LanguageModelFeaturizer, and it runs perfectly. Is it mean the LanguageModelFeaturizer doesn’t support “language: zh”?

Thanks a lot!

Jasper

1 Like

Hi Jasper,

Strange, that shouln’t happen. I’m wondering what is happening here.

Just to confirm, if you remove the LanguageModelFeaturizer component, does the error persist? I’m wondering if there’s a mismatch between the tokeniser and the language model.

Could you confirm the Rasa version here? Also the huggingface version?

rasa --version
pip freeze | grep huggingface

Related, I just merged a PR for spaCy. Soon, you should also be able to get pre-trained language models for chinese via spaCy tooling as well.

2 Likes

Yes, no error is raised after remove the LanguageModelFeaturizer component and the training is completed smoothly.

For the rasa --version, I have got:

> Rasa SDK Version : 2.4.0
> Rasa X Version   : None
> Python Version   : 3.8.2

and for pip freeze | grep huggingface, I cannot find the package but I have got transformers in pip: transformers==4.2.2

Happy to know the update for spaCy! I tried it before and I was quite sad when it told me SpacyNLP did not support Chinese currently.

Thanks a lot!

Strange.

One thing to perhaps try out, could you try using these settings?

pipeline:
  - name: LanguageModelFeaturizer
    # Name of the language model to use
    model_name: "bert"
    # Pre-Trained weights to be loaded
    model_weights: "rasa/LaBSE"

I’m mentioning this model because it is explicitly mentioned in our docs and the LaBSE model is trained for the multi-language base. One of the languages it was trained on is Chinese, so this might be an alternative to try.

Sorry for the late reply and I have tried your setting with LaBSE finally, but it gave me the same issue as the BERT model when training the component of DIETClassifier:

tensorflow.python.framework.errors_impl.InvalidArgumentError:  All dimensions except 2 must match. Input 1 has shape [64 13 768] and doesn't match input 0
 with shape [64 21 128].
         [[node gradient_tape/ConcatOffset_1 (defined at C:\Users\p768l\AppData\Roaming\Python\Python38\site-packages\rasa\utils\tensorflow\models.py:157)
 ]] [Op:__inference_train_function_54635]

Function call stack:
train_function

And for your information, I can train and run the pipeline with both BERT and LaBSE if I replace the DIETClassifier with SklearnIntentClassifier and CRFEntityExtractor.

Thanks a lot!

1 Like

Hi Wai Yin Li,

I’m now wondering if this is perhaps a bug that we should investigate. Is it possible for you to send me a minimum viable example of nlu.yml and config.yml that I might be able to run locally? If I can confirm the error I’ll gladly start a GitHub issue for it.

The error is originating from the concatenation of sequence features in DIET Classifier from two featurizers that differ in the number of tokens (batch size x tokens x embedding).

I am able to reproduce with a custom component that uses a tokenizer not in the NLU pipeline. There is likely a similar issue occurring between the Jieba Tokenizer and the tokenizer in the LanguageModelFeaturizer.

@p768lwy3, did using the spacy tokenizer resolve this issue?

2 Likes

I also encountered this error. After a long time of modification, it was still useless, so I had to give up using Diet :pensive:

I encountered the same problem when testing with cross-validation. The curious thing was that some models worked others didn’t. After a long time investigating I found the problem in my nlu data. We use automated scripts that read our excel files with the nlu data and write them in rasa nlu format and also label our entities in the process. For some reason the scripts sometimes introduce a space between the words that is not a whitespace. You will only see it when your IDE is set to show the white-spaces. Maybe this helps when you are investigating.