I imagine there is a balance to be struck between too much and not enough training data.
Say I have an intent that includes one slot. Say I have come up with a dozen ways this intent (question) could be asked, not including the variation within the slot. If I write all those 12 questions and only use one and the same value for the slot in all 12 questions, Rasa seems to have a hard time generalizing to recognize other unseen slot values. Probably expected.
To overcome that issue, say I multiply those 12 questions by 2 or 3 by listing them all again with a second and then a third slot value. That should help Rasa generalize on the slot value a little better. Is this enough?
Say I have a database of 300 possible values for an entity, all of which are valid values for this slot. Should I generate 200 x 12 questions as training data? Is that optimal? It’s certainly doable, but I’m not sure why this would be necessary. Doesn’t Rasa do some kind of generalization around the slot already. Is there’s a way for me to tell Rasa all valid values of an entity and then tell it that any of those entity values can fill the slot?
This is a question about what appropriate training data looks like, not how to generate the training data. I have seen similar questions in the forum, but the only replies are pointers to tools that generate training data or help in training as a UI. I can generate a lot of training data if I need to using a python script and my database.