Comparison between 2 models on RASA 2.0

Version of RASA
rasa 2.0.0rc2

We are trying to compare 2 different pipelines for the same training data.(Currently, we are using the default training data and test cases provided by RASA when doing rasa init).

The command used for testing each pipeline is

rasa test nlu --nlu data/nlu.yml --config config_hfp.yml --cross-validation

The final output in command line for each configuration run comes as below:

2020-10-20 13:11:56 INFO rasa.test - CV evaluation (n=5)

2020-10-20 13:11:56 INFO rasa.test - Intent evaluation results

2020-10-20 13:11:56 INFO rasa.nlu.test - train Accuracy: 0.977 (0.019)

2020-10-20 13:11:56 INFO rasa.nlu.test - train F1-score: 0.988 (0.010)

2020-10-20 13:11:56 INFO rasa.nlu.test - train Precision: 1.000 (0.000)

2020-10-20 13:11:56 INFO rasa.nlu.test - test Accuracy: 0.719 (0.133)

2020-10-20 13:11:56 INFO rasa.nlu.test - test F1-score: 0.716 (0.127)

2020-10-20 13:11:56 INFO rasa.nlu.test - test Precision: 0.762 (0.122)

Using this result, can we use the F1-score to compare the 2 pipelines.

Can we conclude that the pipeline with the higher F1-score is the one that we should use for better chatbot performance?

It means that the pipeline with the higher F1 score did better on the set of data you were using to test it, not necessarily that it will be better overall. (Evaluating/compating ML models is a bit tricky. Section 2.8 of this book chapter has some more discussion.)

In general, I recommend collecting user data from testers using Rasa X, annotating that and using that as you validation data. And, since the way that people use language and what they talk about changes over time, I’d recheck with fresh data when you’re looking at a new pipeline.

@rctatman, Thank you very much. This is a very useful pointer. As per your best practice suggestion, we will do multiple iterations of adding more training data and performing k-fold cross-validation each time to do comparisons. We also chanced upon the rasalit tool in github which makes it easy to do the comparisons every time.

1 Like