I am trying to understand best practices for including stories as end-to-end test cases vs additional training data. If these cases are included in the training data, then the model would overfit to these examples and therefore easily pass them as end-to-end test cases. Therefore, it would seem best to use the test cases for edge cases and mission-critical cases would be included in both the training data and test-cases for good measure (although not as a test of generalizability). I wanted to hear from others what their thinking was on this, so figured I’d start a discussion.
Hi Will: good question! I also ran into this issue.
All though I doesn’t have the clue, I would add some considerations to this discussion.
When designing stories, it could be best practice to split up story, see [story breaks](Writing Conversation Data].
When annotating and testing conversations, I think a good conversation could be worth to be added to trainingsdata, even when Rasa all-ready predicted the whole conversation correctly. Pro: all-thoug at the moment it is predicted correct, later on it could be predicted wrong. You don’t want to have a (production) user have wrong dialogues
Contra: over-fitting. Especially when the conversation as a whole is already part of the trainingsdata
I agree with your suggestion, in cases we add such a correct conversation to trainingsdata, we also should add it to the test set.
And your thoughts on mission-critical cases as wel as on edge cases seems to be a good rule of thumb!
Hope some others join this discussion: looking forward to more considerations and implicatins!