Advice on generating the right amount of NLU training data

I imagine there is a balance to be struck between too much and not enough training data.

Say I have an intent that includes one slot. Say I have come up with a dozen ways this intent (question) could be asked, not including the variation within the slot. If I write all those 12 questions and only use one and the same value for the slot in all 12 questions, Rasa seems to have a hard time generalizing to recognize other unseen slot values. Probably expected.

To overcome that issue, say I multiply those 12 questions by 2 or 3 by listing them all again with a second and then a third slot value. That should help Rasa generalize on the slot value a little better. Is this enough?

Say I have a database of 300 possible values for an entity, all of which are valid values for this slot. Should I generate 200 x 12 questions as training data? Is that optimal? It’s certainly doable, but I’m not sure why this would be necessary. Doesn’t Rasa do some kind of generalization around the slot already. Is there’s a way for me to tell Rasa all valid values of an entity and then tell it that any of those entity values can fill the slot?

This is a question about what appropriate training data looks like, not how to generate the training data. I have seen similar questions in the forum, but the only replies are pointers to tools that generate training data or help in training as a UI. I can generate a lot of training data if I need to using a python script and my database.

1 Like

Hi @tomp. Not 100% sure I understand your problem. When you test your NLU model, do you get the entities (which should be set as slots) extracted correctly?

Sometimes but not always.

I wouldn’t call it a problem, per se. I’m hoping the Rasa community will come up with a set of rules-of-thumb for systematically designing and building training data sets.

I expect that my NLU model will get better with more training data. And it might get as good as possible if I provide it with training data that covers every possible combination of question-sentence-structure and named entity content. I just wanted to hear other peoples’ impressions of my idea to generate training data as the Cartesian product (every possible combination of) question-format + entity-content. Certainly, I don’t expect there can be a better or more complete training set than this. But I wonder if it is necessary.

If no one has experience with this, then I will run some experiments myself.

To explain the Cartesian product idea, say I have a “complete” set of question sentence templates like:

“I want to buy ___.” “Could you help me select ___?” “I’m in the market for ___?” …

And I have a complete set of entity names for my domain such as:

“car insurance” “auto insurance” “home owners insurance” …

I can then write a script that generates training data as all combinations of sentence template + slot fillers. Make sense?

I haven’t dug into the guts of the Rasa code yet, but I’m guessing that if it doesn’t yet, it should in the future, already have enough of a modular NLU model such that I do not need to give it every possible combination. It should learn to recognize every combination in a more modular way. I even came across something in the Rasa documentation a while back that suggested it will generate additional combinations of training data, but that may have been only talking about the dialog model not the NLU model.

Thoughts and pointers to documentation are both welcome.

1 Like

I think that’s stories, so the core part. If I remember right the default augmentation factor is 20 and it basically puzzles your stories together in different ways.

I really like your question in general here, I’m struggling to find advice on how to build training data and stories that’s more than “add stuff until it works”. Surely adding every single possibility should not be required, as it would be some pretty shitty deep learning if it can’t generalise at all. I remember reading something about using rasa lookup tables rather than a gazillion examples of entities but not sure where so it’s unreliable info

Certainly entity recognition improves markedly with larger training sets.

The risk is overfitting, introducing an excessive preference for some paths in the machine learning model so it less able to interpolate the dodgy inbetween ones.

I’m no machine learning expert but I gather it’s an art, providing balanced non comprehensive training data. Suck it and see. Apply generation only as needed. Be sure to generate testing data and run that against your model to be quantative about the effects of changes.

I’m using chatito to generate variations of slot values for large sets like countries. It offers options for sampling subsets and sample distribution.

I figure a combination of handcarved and generated to help with slots is a good place to start, improved by capturing and integrating real user interactions.


It has been a while since this post, but still relevant question I believe :slight_smile: Any thoughts or experiences by the Rasa team on this?

When it comes to quantity of training data, is it like:

  • “the more (data), the better”?
  • Can you add too much utterances?
  • Is their a way to measure the quality of utterances (except for the trial-and-error way: adding it to the training set, retraining and check performance)? (i.e. useful to add a utterance to the training set or not)

Or do you have any suggestions to read about research done in this matter? Maybe a comparison between generated training data by humans, tools like chatito or language models? Is the latter useful?

@koaning, maybe you have a point-of-view on this based on experience? :slight_smile:

For me, a common tactic for clarity is to approach a statement in extremes.

1. The more data the better?

Well, no. If you add data that isn’t relevant then it won’t contribute much. For example; I can add lots of questions on products to my training data … but if these products are products that my company doesn’t even sell … then I might be contributing to a problem rather than a solution.

I wouldn’t want to add data that doesn’t resemble my users because I might overfit on it.

2. Can you add too much utterances?

Sure, see the same argument as above.

3. Quality of Utterances

This is hard, and if I am frank, it’s easily the hardest part about machine learning. Really, I wish I had a better answer. It’s incredibly hard to quantify quality.

I’ve written lot’s of blogposts that keep hitting this point if you’re interested (exhibit a, exhibit b, exhibit c, exhibit d and exhibit e).

Now internally, we’re certainly exploring ways to help generate meaningful training data. There are some things I can share too!

My colleague @dakshvar22 has been doing a lot of work in this space. He’s investigating if we might be able to use models like GPT-2 to generate paraphrases. There are some promising experiments but a challenge remains the compute time. It certainly seems to be more reliable than chatito though. The downside of chatito is that you generate very artificial sentences that repeat the same structure which doesn’t resemble actual users. The risk is that your ML model will overfit on this. The paraphrasing model

I’m personally doing some work in the realm of bulk labeling. You might be able to use social media data as a “starting point” for your virtual assistant. It should be similar. But! This won’t be 100% perfect either. The way that your users will interact with a chatbot is different than how people talk on Twitter.

All of this effort is nice but … generated data will via these routes should always be considered secondary. It is far less relevant than data that is generated by actual users. If you want your chatbot to talk well, you as a designer need to listen to your users.

Aaaand that’s the unfortunate bit. Users change over time! Your company changes over time! So yes, your validation set should also change over time too! The world is a moving target.

Data quality remains hard. This is why we’re also doing a lot of work to make Rasa X too.

1 Like

Many thanks for your answers! Will definitely go through your blogs asap. Very interesting to get your point of view on this part.

Fully agree that in the end, Conversation Driven Development is the best way to go!

@koaning Hi Vincent,

If you would need to compare NLU models (or Conv AI models in general), are there any metrics you suggest?

Lets say you have a base nlu & core model (trained on dataset A). You then train 2 extra nlu models: Model 1 (=dataset A + extra dataset 1) / Model (=dataset A + extra dataset 2) (extra nlu data only). What are some best practice metrics according to your experience to compare these?
As in: how to go beyond accuracy (or F1-score) when it comes to quantifying their performance?

To be completely honest, my favorite method is “eye-balling”. Summary statistics are all very grand, but I try to understand what kinds of mistakes are often made. I’m interested in knowing how often a model will fail, but I’m even more interested in understanding the types of scenarios in which it fails. This is much more a qualitative endeavor than a quantitative one.

Practically though, be careful with judging a system on something other than actual production data. In the assistant use-case, don’t distract yourself too much with datasets generated by non-users. In the end, the users matter way more than a benchmark in a CSV file.

Fair point! I fully understand what you are saying and believe that looking at the actual data/results and ideally taking enduser experience into account is the best way to go. However in this case, I was just looking for inspiration, if there is a way to get a first view on the performance already (but then a bit more than a general F1 score :smiley:) But will think about a sort of “A/B test-setup” then maybe!

FYI: found this one when doing some research [2005.04118] Beyond Accuracy: Behavioral Testing of NLP models with CheckList . Comes with a github repo.
Not fully suitable for (end-2-end) Conversational AI, but interesting to read.

Ah yeah! That’s a pretty cool paper! There’s some good stuff in there.

One thing I am currently exploring (and building!) is a tool to check the effect of spelling errors. To me, it makes sense to have a rasalit app where you can enter a word and simulate what might happen when we inject common spelling errors. It occurs to me as something that’s incredibly common on Iphones.

Work in progress though.