80% train 20% test for stories?

Hello everyone,

I wanted to see how everyone is handling their train/test stories. I am currently working with an e2e test set of 500 stories, and I have 207 stories in my train set. I wanted to see if the train stories and e2e stories should follow the 80% / 20% rule? Is it best practice to follow this for all test/train sets in RASA?

Hi Chris.

While the rasa test nlu command allows you to properly run a gridsearch the rasa test core does not. In that sense I’d say it makes sense to make a seperate set of stories to test on but I’d make sure that both the test as well as train set cover the same ground. If there’s an imbalance between the two (say the test set contains all the easy stories and the train set contains all the hard ones) then it will be hard to assign a lot of value to a summary statistic.

The 80/20 rule can be fine assuming both sets are large enough to be representative of the use-case you’re trying to solve and that both sets are balanced. Do you have specific concerns in your usecase?

In general when it comes to judging models this venn-diagram is the best advice that I can give on the topic;

The stories you optimise towards may be different than the stories that your users generate. Your main concern therefore is to make sure that the stories that occur in real life are also the stories that you optimise towards. Your chatbot may be really good at FAQ because these are properly represented in your stories but if your users ask for chitchat instead then you’re at risk of overfitting.

Koaning, thank you for your response. My concern is, I am using the 80/20 rule for the NLU but I was not doing the same for the core.

Pardon my ignorance, I am fairly new to using RASA but what it sounds like you are saying is that we should create stories for any and all interactions our bot may come across? Without trained stories the bot may not function properly?

Pardon my ignorance

Absolutely no worries. I’m here to understand what our users find confusing and you’re asking important questions.

it sounds like you are saying is that we should create stories for any and all interactions our bot may come across

Yes. The stories that you create represent flows of dialogue that the digital assistant needs to be able to handle. Also note that the intents and entities that are predicted by the NLU part of the library are used as input for the “dialogue policy” models that predict the next best action to take. If you’re intersted in more details about this you may appreciate this video on TED (it’s but one policy model, but it might help you understand some underlying details).

Without trained stories the bot may not function properly?

This is also true.

Yes. The stories that you create represent flows of dialogue that the digital assistant needs to be able to handle

Normally I don’t mix train data and test data but in this case of story paths, is it wise to mix?

I think it is impossible to have stories in train/test that are 100% different. There’s bound to be overlap. Most chatbots need to be able to greet and say goodbye. It’d be weird to have all no stories with goodbye in train set. It’d also be weird to have no stories with hello in the test set.

That said, I’d keep the test set as a unit test of sorts. These are examples that should be representative of your use-case and you’ll use this set as a proxy to determine if you like the results that you see.