How large is a typical dataset for training Rasa NLU?

Hi all !

I’m starting using Rasa NLU for some classification tasks.

I’m wondering: how big should my training dataset be in order to achieve a pretty good classification? :thinking:

I assume a lot of you already built products using Rasa NLU with more or less success. Could you give a feedback about your performances given the size of your dataset and the number of intents? I guess it would help a lot of newcomers like me.

ex: I trained Rasa NLU with 100 000 sentences for classifying 20 intents and it got 80% accurate classification on my test dataset

Thank you :grin: !

Hi,

Welcome to the community.

I am stilling building bot using RASA but I will definitely share my experience once it is ready.

Unfortunately, there is no straightforward answer to this question. It depends a lot on your intent and entities.

If your intent or entities are easily confusable. Then you definitely need more training data. Training data for each intent increases with the addition of every new intent or entities.

To get the confidence you can also evaluate after you have trained your model.

You can use Chatito to generate more training data.

All the best.

Hi,

Indeed, it makes sense that the more intents I have and the blurrier the border is between two intents, the more I will need data to make Rasa able to find a pattern for each intent and able to classify sentences. Thank you for pointing that out.

May I still ask you how big is your training dataset ?

Thanks

Is 80% not pretty good? :smiley: take a look at the nlu evaluation script and see which intents are getting confused to optimize performance: https://rasa.com/docs/core/evaluation/