Rasa split data nlu fails. Which algorithm is implemented?

Hi, I want to split data by command: rasa split data nlu --training-fraction 0.8 but the number of samples of test file is not exactly is 20% of total samples. In fact, It is 21.7%. I have a question:

  • Why this happens to me ?
  • Which is the built-in spliting data algorithm of rasa ? Is it

sklearn.model_selection.train_test_split

as of DIETClassifiers ?

does 20% correspond to integer number?

I have 701 samples (18 distinct intents, 0 entities). I want my test set have 701 * 20% = 140 or 141 samples but rasa outputted 148 samples. Could you explain the algorithm ?

here is the method: rasa/training_data.py at 003403dc8fe537da021ee05511addb4e2f605111 · RasaHQ/rasa · GitHub

Hi @Ghostvv . After several months, I found the reason why rasa data split does not give precise number of training samples.
Say overall we have X samples (x1 samples of label l1, x2 samples of label l2, …) and training-fraction is 0.8. (Note: x1 + x2 + … = X).
In the code you sent, number of training samples is A = int(0.8 * x1) + int(0.8 * x2) + …
Mathematically, A <= int(0.8 * X).

I am wondering the limitation of substraction of A and int(0.8 * X) :slight_smile:

oh, so we loose training examples, that is a bug

1 Like

could you please create an issue for that

on my way !

Issue is submited at #6582

1 Like