Bad intent recognition for skewed dataset

Hiya all,

I have a pretty skewed dataset due to the generation of examples using entities. I have a couple of intents (inform among others) that use entities. To train CRF properly I need to generate the same sentence with a different entity all the time. Thus, the inform intent is about 3000 examples long.

Coupling this with intents that don’t use entities (e.g. tell_me_a_joke) makes the model misidentify these smaller intents since these only have about 50 examples (and for the life of me I cannot get any more examples than these 50).

Is there any way to de-skew the model by weighing the smaller intents more for instance?

The docs site is 404-ing now, so I can’t link the docs to confirm, but I think in the training data you can specify entities, and the intent training examples just need the one entity. So just move all of your separate entities out of your intent training examples and it should be able to identify other intents better. Remind me when the docs are back up and I’ll see if I can’t confirm that.

Okay, so I found the new docs, they just aren’t at the top of google searches yet.

So it has a section on avoiding the overfitting you seem to have by using Lookup Tables. That should fit your apparent use case.

So I am using lookup tables, have been for a long time. However, if I don’t specify the synonym values in the training data as well as the actual values, the model still doesn’t pick the synonyms up as values for my entities :expressionless:

Ah, I misunderstood how synonyms worked with Rasa. Looks like you do need each entity synonym present in a training example for it to be recognized, then replaced with the synonymous value.

I suppose you could preprocess your messages and do synonym recognition and replacement before sending it to your nlu model, then you only need the few unique utterances for each intent.

I got it working, kind of.

NER_CRF learns from the word positions in a sentence, as well as multi word entities you tag, and the existence of synonyms.

So my fix:

  • make a sentence with the original entity value: “what is vazal
  • make the same sentence with a synonym: “what is horige
  • repeat that a couple of times with different sentences and entities.

It picks the right sentences up pretty quickly this way.