One thing to perhaps try out, could you try using these settings?
- name: LanguageModelFeaturizer
# Name of the language model to use
# Pre-Trained weights to be loaded
I’m mentioning this model because it is explicitly mentioned in our docs and the LaBSE model is trained for the multi-language base. One of the languages it was trained on is Chinese, so this might be an alternative to try.
Sorry for the late reply and I have tried your setting with LaBSE finally, but it gave me the same issue as the BERT model when training the component of DIETClassifier:
tensorflow.python.framework.errors_impl.InvalidArgumentError: All dimensions except 2 must match. Input 1 has shape [64 13 768] and doesn't match input 0
with shape [64 21 128].
[[node gradient_tape/ConcatOffset_1 (defined at C:\Users\p768l\AppData\Roaming\Python\Python38\site-packages\rasa\utils\tensorflow\models.py:157)
Function call stack:
And for your information, I can train and run the pipeline with both BERT and LaBSE if I replace the DIETClassifier with SklearnIntentClassifier and CRFEntityExtractor.
I’m now wondering if this is perhaps a bug that we should investigate. Is it possible for you to send me a minimum viable example of nlu.yml and config.yml that I might be able to run locally? If I can confirm the error I’ll gladly start a GitHub issue for it.
The error is originating from the concatenation of sequence features in DIET Classifier from two featurizers that differ in the number of tokens (batch size x tokens x embedding).
I am able to reproduce with a custom component that uses a tokenizer not in the NLU pipeline. There is likely a similar issue occurring between the Jieba Tokenizer and the tokenizer in the LanguageModelFeaturizer.
@p768lwy3, did using the spacy tokenizer resolve this issue?
I encountered the same problem when testing with cross-validation. The curious thing was that some models worked others didn’t. After a long time investigating I found the problem in my nlu data. We use automated scripts that read our excel files with the nlu data and write them in rasa nlu format and also label our entities in the process. For some reason the scripts sometimes introduce a space between the words that is not a whitespace. You will only see it when your IDE is set to show the white-spaces.
Maybe this helps when you are investigating.