Data Augmentation

I am trying to figure out what exactly is data Augmentation and how it works. I read the documentation and tried using augmentation_factor as 20 ( which is default ) and also as 0.
Documentation says " You actually want to teach your policy to ignore the dialogue history when it isn’t relevant and just respond with the same action no matter what happened before." .
For any augmentation factor, I get same response.

Can someone give a better idea about what is this and how it is useful?

4 Likes

@pnandhini Not sure if you’re still having trouble with this, but this is the first result from google when you enter how much data augmentation should I use rasa.

I’ll try to do my best, as I’m just entering the ML field, but from what I understand:

By default, Rasa will use a common tactic in the ML toolbelt called data augmentation. The concept is essentially to generate more data from the data provided to better train the model.

There are a lot of methods to do this, but Rasa uses the shorter, simpler stories you have to generate longer dialogues. So instead of just training on this:

## story 1
* intent_1
 - utterance_1

## story 2
* intent_2
 - utterance_2
 - utterance_3

<!-- ... and so on...  -->

it will glue the stories together to something like this:

## generated story 1
* intent_1
 - utterance_1
* intent_2
 - utterance_2
 - utterance_3
<!-- ... and so on... -->

By doing this, you are teaching the bot to ignore dialogue history for these examples, since you’ll have other stories before and/or after the story in question, so the model learns to classify the intent and ‘map’ it to the utterance.

In Rasa, if you pass --augmentation-factor 0 in your training command you’ll actually tell the model to avoid doing any data augmentation. This might be better or worse, it tends to depend on the training data that you have, etc. (factors I don’t know enough about to really explain).

By default, the option is passed as --augmentation-factor 50 (a.k.a. the default value is 50), @akelad this should be reflected in the docs, unless 20 is actually the default augmentation factor.

The augmentation factor is multiplied by 10 to determine the appropriate number of stories to subsample, so by default it subsamples 500 of these glued-together (augmented) stories.

What I’m having trouble figuring out myself is what a good benchmark would be, because training with the default values is leading to an average accuracy of about 90%, and a loss of about 0.6, and increasing subsampling is only marginally helping. I ran a training session with the augmentation factor of 400 (so 4000 subsampled stories), for around 400 stories total, and was only able to bring it down to an accuracy of 92% and loss of 0.48.

For a large amount of stories, is there a good ratio of stories to augmentation factor, since the default may not be as performant? @akelad @Tobias_Wochinger

1 Like