Looking for specific metrics around successful hits when evaluating an NLU-only model

ganeshv · October 23, 2020, 2:05pm

Not sure if this is already available and I’m just not looking in the right place.

I currently have dedicated test sets defined to evaluate F-1 scores, precision and recall for my NLU-only bot. I also conduct a cross validation and a test using a random split to see how my model performs over a number of iterations.

One of the key business metrics I’m looking at is - how many intents does the bot get right, specifically above the threshold of what I’ve tentatively defined as the fallback. I get this information currently through the histogram PNG that is generated, however, the image has me guessing at how many exact samples the bot got right.

Is it possible to also include a JSON with the values for how that histogram is generated?

Another question related to bot evaluation is when I conduct a split test, the support metric that is shown alongside the F-1 score, precision and recall (weighted and macro average at the end) is capped at 999. Is this by design? I significantly increased the number of total examples before the split, but this support value remains unchanged. Am I misreading it?

Many thanks for your help in advance and have a great weekend ahead!

ganeshv · October 26, 2020, 7:51am

Bumping this up for visibility. Hoping I can get some help on this.

koaning · October 27, 2020, 1:10pm

One thing that I’m wondering. Would this perhaps be easier if you instead just took the trained modelling pipeline and used it in a jupyter notebook to make predictions?

koaning · October 27, 2020, 2:06pm

This question has convinced me to write a small blogpost about this actually. I’ll follow up with some code examples such that you can see how to use the tools in scikit-learn to help you here.

erohmensing · October 27, 2020, 2:48pm

If you add the --successes <filename> CLI arg to rasa test nlu, you will get a JSON list of the successes and their confidences similar to the list of failures output by default!

ganeshv · October 27, 2020, 5:39pm

Thank you @koaning - Unfortunately, I don’t have a lot of experience with Jupyter Notebooks. What do you mean by ‘trained modelling pipeline’ in this case?

I look forward to the blog post with specific code examples; thank you for this!

Thank you also @erohmensing - I’m currently making updates to my intents and training examples and I have this command saved for the next time I want to run an evaluation. I hope to have an answer for you by tomorrow eod.

Topic		Replies	Views
How to evaluate FallbackPolicy Rasa Open Source	0	315	January 27, 2020
Inconsistency between results/intent_errors.json and rasa shell nlu Rasa Open Source	7	551	July 15, 2021
Multiple questions related to testing the bot Rasa Open Source testing	2	1119	May 14, 2020
Rasa NLU Cross Validation Evaluation Rasa Open Source	1	1251	December 20, 2018
The 'testing stories' output Rasa Open Source	2	958	September 17, 2019

Looking for specific metrics around successful hits when evaluating an NLU-only model

Related topics