Comparison between 2 models on RASA 2.0

abu · October 20, 2020, 5:30am

Version of RASA
rasa 2.0.0rc2

We are trying to compare 2 different pipelines for the same training data.(Currently, we are using the default training data and test cases provided by RASA when doing rasa init).

The command used for testing each pipeline is

rasa test nlu --nlu data/nlu.yml --config config_hfp.yml --cross-validation

The final output in command line for each configuration run comes as below:

2020-10-20 13:11:56 INFO rasa.test - CV evaluation (n=5)

2020-10-20 13:11:56 INFO rasa.test - Intent evaluation results

2020-10-20 13:11:56 INFO rasa.nlu.test - train Accuracy: 0.977 (0.019)

2020-10-20 13:11:56 INFO rasa.nlu.test - train F1-score: 0.988 (0.010)

2020-10-20 13:11:56 INFO rasa.nlu.test - train Precision: 1.000 (0.000)

2020-10-20 13:11:56 INFO rasa.nlu.test - test Accuracy: 0.719 (0.133)

2020-10-20 13:11:56 INFO rasa.nlu.test - test F1-score: 0.716 (0.127)

2020-10-20 13:11:56 INFO rasa.nlu.test - test Precision: 0.762 (0.122)

Using this result, can we use the F1-score to compare the 2 pipelines.

Can we conclude that the pipeline with the higher F1-score is the one that we should use for better chatbot performance?

rctatman · October 26, 2020, 8:35pm

It means that the pipeline with the higher F1 score did better on the set of data you were using to test it, not necessarily that it will be better overall. (Evaluating/compating ML models is a bit tricky. Section 2.8 of this book chapter has some more discussion.)

In general, I recommend collecting user data from testers using Rasa X, annotating that and using that as you validation data. And, since the way that people use language and what they talk about changes over time, I’d recheck with fresh data when you’re looking at a new pipeline.

abu · October 27, 2020, 11:03am

@rctatman, Thank you very much. This is a very useful pointer. As per your best practice suggestion, we will do multiple iterations of adding more training data and performing k-fold cross-validation each time to do comparisons. We also chanced upon the rasalit tool in github which makes it easy to do the comparisons every time.

Topic		Replies	Views
Comparing pipeline Performance Getting Started with Rasa	2	162	January 20, 2021
A way to compare different NLU Pipelines with new test data Rasa Open Source	4	709	September 10, 2021
Explain results Feedback on Rasa Open Source	0	407	November 8, 2022
Difference in intent prediction confidence values across rasa1.x and rasa2.x Rasa Open Source	3	557	June 9, 2021
Rasa NLU Cross Validation Evaluation Rasa Open Source	1	1250	December 20, 2018

Comparison between 2 models on RASA 2.0

Related topics