I trained a DIET model with these regexes in my NLU.md file, with the RegexFeaturizer turned on in my pipeline, however, my bot didn’t understand a certain year that matches the regex (four digits).
Is the regex information in the nlu.md file only for interpretation by the RegexFeaturizer, or is there a way to turn on exact matching entities according to the regex with higher priority than the DIET / ML model?
Use word boundaries \b[0-9]{4}\b
around your regexes to improve them. With your regexes the year will match two times with any zipcode.
Do you have training examples with years in your training data?
I do have many examples. DIET found 1980 but not 2334.
An update: This is still an issue for me:
Here are my regex patterns:
## regex:yearBorn
- \b[0-9]{4}\b
## regex:zipCode
- \b[0-9]{5}\b
Rasa still makes interpretations that confuse these entities:
language: en
pipeline:
- name: ConveRTTokenizer
intent_tokenization_flag: true
intent_split_symbol: +
- name: ConveRTFeaturizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 30
num_transformer_layers: 4
transformer_size: 256
use_masked_language_model: false
drop_rate: 0.25
weight_sparsity: 0.7
batch_size:
- 32
- 128
embedding_dimension: 30
hidden_layer_sized:
text:
- 512
- 128
- name: DucklingHTTPExtractor
url: http://localhost:8000
dimensions:
- time
- number
locale: en_US
timezone: US/Pacific
timeout: 3
- name: EntitySynonymMapper
Akhil
(Akhil)
September 4, 2020, 3:08pm
5
Hi @argideritzalpea . Did u solve it?
If not could try removing - name: LexicalSyntacticFeaturizer
and see?