Bad intent recognition for skewed dataset

Remy · May 21, 2019, 1:53pm

Hiya all,

I have a pretty skewed dataset due to the generation of examples using entities. I have a couple of intents (inform among others) that use entities. To train CRF properly I need to generate the same sentence with a different entity all the time. Thus, the inform intent is about 3000 examples long.

Coupling this with intents that don’t use entities (e.g. tell_me_a_joke) makes the model misidentify these smaller intents since these only have about 50 examples (and for the life of me I cannot get any more examples than these 50).

Is there any way to de-skew the model by weighing the smaller intents more for instance?

matthew.cavener · May 21, 2019, 11:00pm

The docs site is 404-ing now, so I can’t link the docs to confirm, but I think in the training data you can specify entities, and the intent training examples just need the one entity. So just move all of your separate entities out of your intent training examples and it should be able to identify other intents better. Remind me when the docs are back up and I’ll see if I can’t confirm that.

Okay, so I found the new docs, they just aren’t at the top of google searches yet.

So it has a section on avoiding the overfitting you seem to have by using Lookup Tables. That should fit your apparent use case.

Remy · May 22, 2019, 8:30am

So I am using lookup tables, have been for a long time. However, if I don’t specify the synonym values in the training data as well as the actual values, the model still doesn’t pick the synonyms up as values for my entities

matthew.cavener · May 22, 2019, 9:44pm

Ah, I misunderstood how synonyms worked with Rasa. Looks like you do need each entity synonym present in a training example for it to be recognized, then replaced with the synonymous value.

I suppose you could preprocess your messages and do synonym recognition and replacement before sending it to your nlu model, then you only need the few unique utterances for each intent.

Remy · May 27, 2019, 10:42am

I got it working, kind of.

NER_CRF learns from the word positions in a sentence, as well as multi word entities you tag, and the existence of synonyms.

So my fix:

make a sentence with the original entity value: “what is vazal”
make the same sentence with a synonym: “what is horige”
repeat that a couple of times with different sentences and entities.

It picks the right sentences up pretty quickly this way.

Topic		Replies	Views
RASA NLU training with custom entities in depth Rasa Open Source	4	2476	October 1, 2019
Confusion in Using Entity Synonyms Rasa Open Source	1	849	June 1, 2021
Advices for creating a data set Rasa Open Source	8	1148	September 27, 2018
Does each of the sentence must have the entity to train? Rasa Open Source	1	510	September 4, 2018
Need help for data training Rasa Open Source	6	468	March 13, 2020

Bad intent recognition for skewed dataset

Related topics