Bootstrapping domain training

I’m interested to learn of any public datasets which we can use to bootstrap domain models. We’re using RASA to power a voice first conversational model.

Has anyone any experience taking call recordings to distill this into NLU training? Are there any datasets already available I can use?

Thanks.

We’ve got some demos that are open source. These bots also contain datasets that you can use for general benchmarking.

But these all assume conversational situations over text, not voice.

It may also be worthwhile to point out that it’s tricky to benchmark your approach using somebody else’s data.

image

In the end the stories/conversations that you optimise for should be the stories/conversations that your users generate. If the overlap between these two datasets is not big, you may be at risk of optimising something that won’t help your end-users.