Error when building pipeline with LanguageModelFeaturizer in lanauge zh

Hi everyone, I am new to rasa and I have faced the following issue when “Starting to train component DIETClassifier” in the pipeline building process with the given configuration:

tensorflow.python.framework.errors_impl.InvalidArgumentError: All dimensions except 2 must match. Input 1 has shape [64 8 768] and doesn’t match input 0 with shape [64 11 128].

The configuration is:

language: zh

pipeline:

  • name: JiebaTokenizer dictionary_path: ./jieba_userdict
  • name: LanguageModelFeaturizer model_name: bert model_weights: bert-base-chinese cache_dir: null
  • name: RegexFeaturizer
  • name: DIETClassifier
  • name: EntitySynonymMapper
  • name: ResponseSelector epochs: 100

I have tried a “language: en” model with LanguageModelFeaturizer, and it runs perfectly. Is it mean the LanguageModelFeaturizer doesn’t support “language: zh”?

Thanks a lot!

Jasper

1 Like

Hi Jasper,

Strange, that shouln’t happen. I’m wondering what is happening here.

Just to confirm, if you remove the LanguageModelFeaturizer component, does the error persist? I’m wondering if there’s a mismatch between the tokeniser and the language model.

Could you confirm the Rasa version here? Also the huggingface version?

rasa --version
pip freeze | grep huggingface

Related, I just merged a PR for spaCy. Soon, you should also be able to get pre-trained language models for chinese via spaCy tooling as well.

2 Likes

Yes, no error is raised after remove the LanguageModelFeaturizer component and the training is completed smoothly.

For the rasa --version, I have got:

> Rasa SDK Version : 2.4.0
> Rasa X Version   : None
> Python Version   : 3.8.2

and for pip freeze | grep huggingface, I cannot find the package but I have got transformers in pip: transformers==4.2.2

Happy to know the update for spaCy! I tried it before and I was quite sad when it told me SpacyNLP did not support Chinese currently.

Thanks a lot!

Strange.

One thing to perhaps try out, could you try using these settings?

pipeline:
  - name: LanguageModelFeaturizer
    # Name of the language model to use
    model_name: "bert"
    # Pre-Trained weights to be loaded
    model_weights: "rasa/LaBSE"

I’m mentioning this model because it is explicitly mentioned in our docs and the LaBSE model is trained for the multi-language base. One of the languages it was trained on is Chinese, so this might be an alternative to try.

Sorry for the late reply and I have tried your setting with LaBSE finally, but it gave me the same issue as the BERT model when training the component of DIETClassifier:

tensorflow.python.framework.errors_impl.InvalidArgumentError:  All dimensions except 2 must match. Input 1 has shape [64 13 768] and doesn't match input 0
 with shape [64 21 128].
         [[node gradient_tape/ConcatOffset_1 (defined at C:\Users\p768l\AppData\Roaming\Python\Python38\site-packages\rasa\utils\tensorflow\models.py:157)
 ]] [Op:__inference_train_function_54635]

Function call stack:
train_function

And for your information, I can train and run the pipeline with both BERT and LaBSE if I replace the DIETClassifier with SklearnIntentClassifier and CRFEntityExtractor.

Thanks a lot!

Hi Wai Yin Li,

I’m now wondering if this is perhaps a bug that we should investigate. Is it possible for you to send me a minimum viable example of nlu.yml and config.yml that I might be able to run locally? If I can confirm the error I’ll gladly start a GitHub issue for it.

The error is originating from the concatenation of sequence features in DIET Classifier from two featurizers that differ in the number of tokens (batch size x tokens x embedding).

I am able to reproduce with a custom component that uses a tokenizer not in the NLU pipeline. There is likely a similar issue occurring between the Jieba Tokenizer and the tokenizer in the LanguageModelFeaturizer.

@p768lwy3, did using the spacy tokenizer resolve this issue?