Looking for specific metrics around successful hits when evaluating an NLU-only model

Not sure if this is already available and I’m just not looking in the right place.

I currently have dedicated test sets defined to evaluate F-1 scores, precision and recall for my NLU-only bot. I also conduct a cross validation and a test using a random split to see how my model performs over a number of iterations.

One of the key business metrics I’m looking at is - how many intents does the bot get right, specifically above the threshold of what I’ve tentatively defined as the fallback. I get this information currently through the histogram PNG that is generated, however, the image has me guessing at how many exact samples the bot got right.

Is it possible to also include a JSON with the values for how that histogram is generated?

Another question related to bot evaluation is when I conduct a split test, the support metric that is shown alongside the F-1 score, precision and recall (weighted and macro average at the end) is capped at 999. Is this by design? I significantly increased the number of total examples before the split, but this support value remains unchanged. Am I misreading it?

Many thanks for your help in advance and have a great weekend ahead!

Bumping this up for visibility. Hoping I can get some help on this.

One thing that I’m wondering. Would this perhaps be easier if you instead just took the trained modelling pipeline and used it in a jupyter notebook to make predictions?

This question has convinced me to write a small blogpost about this actually. I’ll follow up with some code examples such that you can see how to use the tools in scikit-learn to help you here.

If you add the --successes <filename> CLI arg to rasa test nlu, you will get a JSON list of the successes and their confidences similar to the list of failures output by default!

Thank you @koaning - Unfortunately, I don’t have a lot of experience with Jupyter Notebooks. What do you mean by ‘trained modelling pipeline’ in this case?

I look forward to the blog post with specific code examples; thank you for this!

Thank you also @erohmensing - I’m currently making updates to my intents and training examples and I have this command saved for the next time I want to run an evaluation. I hope to have an answer for you by tomorrow eod.