Hi,
I want to split data by command: rasa split data nlu --training-fraction 0.8 but the number of samples of test file is not exactly is 20% of total samples. In fact, It is 21.7%.
I have a question:
Why this happens to me ?
Which is the built-in spliting data algorithm of rasa ? Is it
I have 701 samples (18 distinct intents, 0 entities). I want my test set have 701 * 20% = 140 or 141 samples but rasa outputted 148 samples. Could you explain the algorithm ?
Hi @Ghostvv . After several months, I found the reason why rasa data split does not give precise number of training samples.
Say overall we have X samples (x1 samples of label l1, x2 samples of label l2, …) and training-fraction is 0.8. (Note: x1 + x2 + … = X).
In the code you sent, number of training samples is A = int(0.8 * x1) + int(0.8 * x2) + …
Mathematically, A <= int(0.8 * X).
I am wondering the limitation of substraction of A and int(0.8 * X)