Rasa test vs. rasa test core

I run these two commands:

rasa test --stories tests/

rasa test core --stories tests/

The yml files have 16 test stories. If I run rasa test it fails 8. If I run rasa test core none of the stories fail.

I know that rasa test will do nlu tests as well. But I would assume testing the stories that I have specified in the test_*.yml files would be identical for both commands.

Any idea why I got different results?

This is what I see when I run rasa test:

Processed story blocks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 467.92it/s, # trackers=1]
2021-11-07 14:58:46 INFO     rasa.core.test  - Evaluating 16 stories
Progress:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:21<00:00,  1.35s/it]
2021-11-07 14:59:08 INFO     rasa.core.test  - Finished collecting predictions.
2021-11-07 14:59:08 INFO     rasa.core.test  - Evaluation Results on END-TO-END level:
2021-11-07 14:59:08 INFO     rasa.core.test  -  Correct:          8 / 16
2021-11-07 14:59:08 INFO     rasa.core.test  -  Accuracy:         0.500
2021-11-07 14:59:08 INFO     rasa.core.test  - Stories report saved to results/story_report.json.
2021-11-07 14:59:08 INFO     rasa.nlu.test  - Evaluation for entity extractor: TEDPolicy 
2021-11-07 14:59:08 INFO     rasa.nlu.test  - Classification report saved to results/TEDPolicy_report.json.
2021-11-07 14:59:08 INFO     rasa.nlu.test  - Incorrect entity predictions saved to results/TEDPolicy_errors.json.
2021-11-07 14:59:08 INFO     rasa.utils.plotting  - Confusion matrix, without normalization: 
[[ 0  0  0  0 26  0  0]
 [ 0  0  0  0 27  0  0]
 [ 0  0  0  0  8  0  0]
 [ 0  0  0  0 10  0  0]
 [ 0  0  0  0 70  0  0]
 [ 0  0  0  0  6  0  0]
 [ 0  0  0  0 14  0  0]]

And this is what I see when I run rasa test core:

Processed story blocks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 303.18it/s, # trackers=1]
2021-11-07 15:02:39 INFO     rasa.core.test  - Evaluating 16 stories
Progress:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:21<00:00,  1.35s/it]
2021-11-07 15:03:01 INFO     rasa.core.test  - Finished collecting predictions.
2021-11-07 15:03:01 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:
2021-11-07 15:03:01 INFO     rasa.core.test  -  Correct:          16 / 16
2021-11-07 15:03:01 INFO     rasa.core.test  -  Accuracy:         1.000
2021-11-07 15:03:01 INFO     rasa.core.test  - Stories report saved to results/story_report.json.
2021-11-07 15:03:01 INFO     rasa.nlu.test  - Evaluation for entity extractor: TEDPolicy 
2021-11-07 15:03:01 INFO     rasa.nlu.test  - Classification report saved to results/TEDPolicy_report.json.
2021-11-07 15:03:01 INFO     rasa.nlu.test  - Incorrect entity predictions saved to results/TEDPolicy_errors.json.
2021-11-07 15:03:01 INFO     rasa.utils.plotting  - Confusion matrix, without normalization: 
[[ 0  0  0  0 26  0  0]
 [ 0  0  0  0 27  0  0]
 [ 0  0  0  0  8  0  0]
 [ 0  0  0  0 10  0  0]
 [ 0  0  0  0 70  0  0]
 [ 0  0  0  0  6  0  0]
 [ 0  0  0  0 14  0  0]]

@endreb

I know you know these points :slight_smile:

To evaluate a model on your test data, run:

rasa test

This will test your latest trained model on any end-to-end stories you have defined in files with the test_ prefix.

rasa test

rasa test core

If you want to evaluate the dialogue and NLU models separately, you can use the commands below:

rasa test core If you want to evaluate the dialogue and NLU models separately, you can use the commands below:

rasa test core

Note:

rasa test core | Tests Rasa Core models using your test stories.

You confusion matrix is identical with both commands.So, I guess nothing to worry, this is just the test stories you have mentioned.

Summary Points:

  1. Test stories are written in the form of exemplary conversations to check whether the bot will behave as expected. You can write them in test stories file in the project folder. Once you have a good set of test cases, you can run rasa test

  2. Once you have a good set of test cases, you can run. rasa test core --stories test_stories.yml --out results This command will generate a report about failed stories and confusion matrix for each story regardless of whether they failed or not.

Hi @nik202 ,

thanks for the reply. The problem I have is 16/16 stories are correct when I run :rasa test core and only 8/16 stories are correct if I run rasa test. I have a folder tests and inside that folder I have 5 yml files containing 16 stories. What I don’t get is confusion matrix is the same, but how come the success of the stories are not?

Do you know what does: “Evaluation Results on CONVERSATION level” mean vs. “Evaluation Results on END-TO-END level”. I would guess when I run rasa test, somehow the stories evaluated differently than when I run rasa test core? But I don’t know what is the problem.

@endreb I tried to explain everything in the above post in more details and brief ways, my friend.if you need more details please see this doc Testing Your Assistant or see the result files which you generated while running the above commands. Good Luck!

1 Like