Rasa split data nlu fails. Which algorithm is implemented?

duongkstn · May 25, 2020, 4:23am

Hi, I want to split data by command: rasa split data nlu --training-fraction 0.8 but the number of samples of test file is not exactly is 20% of total samples. In fact, It is 21.7%. I have a question:

Why this happens to me ?
Which is the built-in spliting data algorithm of rasa ? Is it

sklearn.model_selection.train_test_split

as of DIETClassifiers ?

Ghostvv · May 25, 2020, 12:57pm

does 20% correspond to integer number?

duongkstn · May 26, 2020, 2:11am

I have 701 samples (18 distinct intents, 0 entities). I want my test set have 701 * 20% = 140 or 141 samples but rasa outputted 148 samples. Could you explain the algorithm ?

Ghostvv · May 26, 2020, 7:35am

here is the method: rasa/training_data.py at 003403dc8fe537da021ee05511addb4e2f605111 · RasaHQ/rasa · GitHub

duongkstn · September 4, 2020, 5:19am

Hi @Ghostvv . After several months, I found the reason why rasa data split does not give precise number of training samples.
Say overall we have X samples (x1 samples of label l1, x2 samples of label l2, …) and training-fraction is 0.8. (Note: x1 + x2 + … = X).
In the code you sent, number of training samples is A = int(0.8 * x1) + int(0.8 * x2) + …
Mathematically, A <= int(0.8 * X).

I am wondering the limitation of substraction of A and int(0.8 * X)

Ghostvv · September 4, 2020, 1:46pm

oh, so we loose training examples, that is a bug

Ghostvv · September 4, 2020, 1:46pm

could you please create an issue for that

duongkstn · September 7, 2020, 3:41am

on my way !

duongkstn · September 7, 2020, 4:19am

Issue is submited at #6582

Topic		Replies	Views
How to split train test data using python Rasa Open Source	9	758	October 13, 2021
How the test/train split works in rasa Rasa Open Source	4	2754	October 20, 2021
Testing rasa nlu Rasa Open Source	1	529	June 22, 2021
Training model before running cross validation Rasa Open Source	1	915	June 10, 2021
SPLIT DATA ISSUE Rasa Open Source	2	385	September 11, 2020

Rasa split data nlu fails. Which algorithm is implemented?

Related topics