Chinese whitespace error

Rasa version:2.2.3

Rasa SDK version (if used & relevant):

Rasa X version (if used & relevant):

Python version:3.6.12

Operating system (windows, osx, …):windows and linux

Issue: When the Chinese training data contains English and spaces, the DIETClassifier cannot be used for training

Error (including full traceback):

2021-02-09 16:39:32.688396: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2021-02-09 16:39:32.688662: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-02-09 16:39:40.660431: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll
2021-02-09 16:39:41.315725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce MX150 computeCapability: 6.1
coreClock: 1.5315GHz coreCount: 3 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 44.76GiB/s
2021-02-09 16:39:41.319505: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2021-02-09 16:39:41.323006: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cublas64_10.dll'; dlerror: cublas64_10.dll not found
2021-02-09 16:39:41.326474: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cufft64_10.dll'; dlerror: cufft64_10.dll not found
2021-02-09 16:39:41.330811: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'curand64_10.dll'; dlerror: curand64_10.dll not found
2021-02-09 16:39:41.334577: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
2021-02-09 16:39:41.338017: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusparse64_10.dll'; dlerror: cusparse64_10.dll not found
2021-02-09 16:39:41.347054: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2021-02-09 16:39:41.347251: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-02-09 16:39:41.348086: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-09 16:39:41.356941: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x242a54dba00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-02-09 16:39:41.357206: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-02-09 16:39:41.357492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-09 16:39:41.357688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      
Some layers from the model checkpoint at bert-base-chinese were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-chinese.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
2021-02-09 16:39:43 INFO     rasa.nlu.components  - Added 'LanguageModelFeaturizer' to component cache. Key 'LanguageModelFeaturizer-bert-68d7c530c1c4708f5657e4ae28219570'.
2021-02-09 16:39:43 INFO     rasa.nlu.model  - Starting to train component JiebaTokenizer
Building prefix dict from the default dictionary ...
Loading model from cache D:\TEMP\jieba.cache
Loading model cost 1.013 seconds.
Prefix dict has been built successfully.
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Finished training component.
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Starting to train component RegexFeaturizer
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Finished training component.
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Starting to train component LexicalSyntacticFeaturizer
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Finished training component.
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Starting to train component LanguageModelFeaturizer
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Finished training component.
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Starting to train component DIETClassifier
Epochs:   0%|          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "D:/NLU/test-project/run.py", line 112, in <module>
    train_nlu()
  File "D:/NLU/test-project/run.py", line 70, in train_nlu
    trainer.train(training_data)
  File "D:\software\anaconda\envs\test-project\lib\site-packages\rasa\nlu\model.py", line 209, in train
    updates = component.train(working_data, self.config, **context)
  File "D:\software\anaconda\envs\test-project\lib\site-packages\rasa\nlu\classifiers\diet_classifier.py", line 818, in train
    self.component_config[BATCH_STRATEGY],
  File "D:\software\anaconda\envs\test-project\lib\site-packages\rasa\utils\tensorflow\models.py", line 242, in fit
    self.train_summary_writer,
  File "D:\software\anaconda\envs\test-project\lib\site-packages\rasa\utils\tensorflow\models.py", line 438, in _batch_loop
    call_model_function(batch_in)
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\def_function.py", line 807, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\function.py", line 550, in call
    ctx=ctx)
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  ConcatOp : Dimensions of inputs should match: shape[0] = [4,9,128] vs. shape[1] = [4,8,768]
	 [[node concat (defined at \software\anaconda\envs\test-project\lib\site-packages\rasa\utils\tensorflow\models.py:929) ]] [Op:__inference_train_on_batch_12510]

Errors may have originated from an input operation.
Input Source operations connected to node concat:
 dropout_39/dropout/Mul_1 (defined at \software\anaconda\envs\test-project\lib\site-packages\rasa\utils\tensorflow\models.py:918)	
 batch_in_9 (defined at \software\anaconda\envs\test-project\lib\site-packages\rasa\utils\tensorflow\models.py:464)

Function call stack:
train_on_batch


Process finished with exit code 1

Content of configuration file (config.yml) (if relevant):

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: zh

pipeline:
# # No configuration for the NLU pipeline was provided. The following default pipeline was used to train your model.
# # If you'd like to customize it, uncomment and adjust the pipeline.
# # See https://rasa.com/docs/rasa/tuning-your-model for more information.
   - name: JiebaTokenizer
   - name: RegexFeaturizer
   - name: LexicalSyntacticFeaturizer
   - name: LanguageModelFeaturizer
     model_name: bert
     model_weights: bert-base-chinese
     cache_dir: null
   - name: DIETClassifier
     epochs: 1
   - name: EntitySynonymMapper
   - name: ResponseSelector
     epochs: 100
   - name: FallbackClassifier
     threshold: 0.3
     ambiguity_threshold: 0.1

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
# # No configuration for policies was provided. The following default policies were used to train your model.
# # If you'd like to customize them, uncomment and adjust the policies.
# # See https://rasa.com/docs/rasa/policies for more information.
   - name: AugmentedMemoizationPolicy
   - name: TEDPolicy
     max_history: 8
     epochs: 200
     hidden_layers_sizes:
       dialogue: [256, 128]
   - name: RulePolicy

Content of nlu file (nlu.yml) (if relevant):

version: "2.0"

nlu:
- intent: greet
  examples: |
    - 嘿
    - hello 中国helloword

- intent: download
  examples: |
    - 下载google
    - 如何才能在下载和安装google app

The reason for DIETClassifier training error is because of the space between google and app in the sentence “如何才能在下载和安装google app”, If remove the space, DIETClassifier will train normally.

How can I solve this problem, please help me, thanks!

From the docs for JiebaTokenizer it states that it will only work for Chinese. Maybe you need a different tokenizer for mixed English and Chinese?