Hello there!
During my migration to Rasa 2.x, I realized that rare intents are very badly classified now. I do have very imbalanced data (some intents have a dozen samples while others (like the ones that are using the ResponseSelector) have a few hundreds. It didn’t cause major problem for most intents in Rasa 1.9.
Even testing on the training data yields terrible results for intents with low support. For example, I have an intent happy
which contains a dozen examples such as you're great!
, I love you
, etc. with a support of only 12 samples. In Rasa 1.9, we had a training recall of ~83% while in Rasa 2.1, I get a recall of 40%.
I am using the same pipeline as before, using Spacy tokenizers/featurizers and the DIETClassifier. From my understanding of the documentation, the DIETClassifier uses a balanced batching which should handle an imbalanced dataset. I have copy-pasted my NLU pipeline at the end of this message.
The only difference I can see is in the settings for the RegexFeaturizer
and the SpacyFeaturizer
which do not have the return_sequence: true
option anymore. A quick glance at the code showed that at least the SpacyFeaturizer
returns both sequence and sentence features. The Spacy versions are also identical (2.1.9)
Any idea of where I could look? What has changed between the two versions?
Thanks a lot for the pointers! Cheers Nicolas
config.yml (Rasa 2.x)
language: "en"
pipeline:
- name: "DucklingEntityExtractor"
url: "http://duckling.alpaca.casa"
dimensions: ["time", "duration", "amount-of-money", "number", "email", "phone-number", "ordinal", "url"]
timezone: "America/New_York"
- name: "SpacyNLP"
case_sensitive: true
- name: "SpacyTokenizer"
- name: "SpacyEntityExtractor"
dimensions: ["PERSON", "MONEY"]
- name: "RegexFeaturizer"
- name: "SpacyFeaturizer"
- name: LexicalSyntacticFeaturizer
- name: "DIETClassifier"
epochs: 50
entity_recognition: true
use_masked_language_model: false
- ... # response selectors
config.yml (Rasa 1.9)
language: "en"
pipeline:
- name: "DucklingHTTPExtractor"
url: "http://duckling.alpaca.casa"
dimensions: ["time", "duration", "amount-of-money", "number", "email", "phone-number", "ordinal", "url"]
timezone: "America/New_York"
- name: "SpacyNLP"
case_sensitive: true
- name: "SpacyTokenizer"
- name: "SpacyEntityExtractor"
dimensions: ["PERSON", "MONEY"]
- name: "RegexFeaturizer"
return_sequence: True # <-- option not available anymore
- name: "SpacyFeaturizer"
return_sequence: True # <-- option not available anymore
- name: LexicalSyntacticFeaturizer
- name: "DIETClassifier"
epochs: 50
entity_recognition: true
use_masked_language_model: false