How to avoid duplicate stories, when creating more samples through interactive learning

Hi,

I have noticed that there are many duplicate stories have been created as part of interactive learning, when adding more sample user inputs for an intent. So my questions are:

  1. Do we really need to create stories for each sample user input added in an intent?
  2. If we have duplicate stories for each intent data, will it help in anyway for machine learning or predictions?
  3. If it is not needed, is it possible to avoid adding those duplicate stories (if we know the story is already available), when exporting the data at the end of interactive learning process?
  4. If we have more duplicate stories in stories.md, will it affect the performance of training or bot conversation? If yes, how to avoid it?

Please advise.

Thank you, Anoop Mohan

Anyone, please share your thoughts on this.

Hi, anyone has any suggestion on this?

Hey Anoop, I agree that we shouldn’t be creating duplicate stories through interactive learning. With regards to your 4th point, I believe it will increase training time while only (maybe) minimally improving the performance of a machine learning policy.

With regards to 3: why are you going through the process of doing IL for this story path if you know it’s already available?

Can you clarify what you mean by question 1?

Thank you @erohmensing for the response.

In my case, I don’t really want to update nlu.md, stories.md, domain.yml files manually, when training the bot with new data. Hence, the only possible way for adding a new intent and defining entities for those intent is the interactive learning method. So that we don’t need to touch any of those files (nlu.md, stories.md, domain.yml) and IL itself will update these files with new data/stories.

Now, if we want to add more data samples to the same intent in future, again I don’t want to update nlu.md file manually, as that is not a good idea to update the file manually for defining entities and synonyms. In this case, still I need to run the interactive learning process to add the new samples under an existing intent.

So, whenever we execute IL, I believe the stories.md file also will get updated with a duplicate story (if we run the IL for adding a new sample to an existing intent) and I don’t see anyway to avoid this.

To summarize, if I add 10 data samples for an intent by running IL, then 10 stories (duplicate) will be added to stories.md.

Regarding my 1st question: Do we really need this 10 stories (as my example above)? I believe we need only 1 story in this case, since all the remaining stories are duplicate. I know that, we might need separate stories based on different path (eg: happy or sad), but we don;t need duplicate stories for the same path (10 stories for same happy path)

Regarding my 3rd question: Is there anyway to avoid creating duplicate stories like this during IL?

Please confirm. Let me know, if you still have any confusion on my question.

Thank you.

I reckon it would actually be much faster to add data by editing the files themselves than by going through IL this time – especially if you’re just going through same path each time just to update NLU data. That being said, there isn’t currently a way to avoid creating duplicate stories during IL, but I’ve created an issue for it. As we’ll probably be working hard to fix bugs on the new Rasa X product, I can’t say when we’ll get around to it. If you’d be interested in contributing to solving this problem, we’d happily take a contribution.

rasa core performs deduplication of stories before passing them to policies