In the config.yml I want to config token_pattern: r’(?u)\b\w+\b’ for chinese under CountVectorsFeaturizer Component. But it doesn’t work.
# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: "zh"
pipeline:
- name: "JiebaTokenizer"
dictionary_path: "jieba_dict"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "CountVectorsFeaturizer"
token_pattern: r'(?u)\b\w+\b'
- name: "EmbeddingIntentClassifier"
# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy
- name: "FallbackPolicy"
nlu_threshold: 0.3
core_threshold: 0.3
fallback_action_name: 'action_default_fallback'
CountVectorsFeaturizer reads my configuration r’(?u)\b\w+\b’ as a normal string not a regex. and failed in train method and go into exception brach:
try:
# noinspection PyPep8Naming
X = self.vectorizer.fit_transform(lem_exs).toarray()
except ValueError:
self.vectorizer = None (will come here)
return