Using 'rasa test' to evaluate end-to-end stories

I’m getting inconsistent results when running ‘rasa test’ from within a docker container that’s running a rasa server versus ‘rasa test’ from my host (Mac) command line.

The file structure for both of these situations is exactly the same, except for the location of the actual rasa directory, which is where I run ‘rasa test’ in both situations. The data and all of the files are almost the same, with only the endpoints.yml file being changed so the dockerized rasa points to a dockerized action_server instead of localhost:5055.

When I run ‘rasa test’ locally, I get a result that looks like this:

2020-06-09 14:51:30 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:
2020-06-09 14:51:30 INFO     rasa.core.test  - 	Correct:          86 / 86
2020-06-09 14:51:30 INFO     rasa.core.test  - 	F1-Score:         1.000
2020-06-09 14:51:30 INFO     rasa.core.test  - 	Precision:        1.000
2020-06-09 14:51:30 INFO     rasa.core.test  - 	Accuracy:         1.000
2020-06-09 14:51:30 INFO     rasa.core.test  - 	In-data fraction: 0.929
2020-06-09 14:51:30 INFO     rasa.core.test  - Evaluation Results on ACTION level:
2020-06-09 14:51:30 INFO     rasa.core.test  - 	Correct:          645 / 645
2020-06-09 14:51:30 INFO     rasa.core.test  - 	F1-Score:         1.000
2020-06-09 14:51:30 INFO     rasa.core.test  - 	Precision:        1.000
2020-06-09 14:51:30 INFO     rasa.core.test  - 	Accuracy:         1.000
2020-06-09 14:51:30 INFO     rasa.core.test  - 	In-data fraction: 0.929

The end-to-end tests don’t appear to even get tested, even though they exist in tests/conversation_tests.md.

When I bash into the docker container running rasa with the command ‘docker exec -it chatbot_container /bin/bash’ and run ‘rasa test,’ I get the following results:

2020-06-09 19:59:27 INFO     rasa.core.test  - Evaluation Results on END-TO-END level:
2020-06-09 19:59:27 INFO     rasa.core.test  - 	Correct:          0 / 1
2020-06-09 19:59:27 INFO     rasa.core.test  - 	F1-Score:         0.000
2020-06-09 19:59:27 INFO     rasa.core.test  - 	Precision:        0.000
2020-06-09 19:59:27 INFO     rasa.core.test  - 	Accuracy:         0.000
2020-06-09 19:59:27 INFO     rasa.core.test  - 	In-data fraction: 0
2020-06-09 19:59:27 INFO     rasa.core.test  - Evaluation Results on ACTION level:
2020-06-09 19:59:27 INFO     rasa.core.test  - 	Correct:          1 / 2
2020-06-09 19:59:27 INFO     rasa.core.test  - 	F1-Score:         0.667
2020-06-09 19:59:27 INFO     rasa.core.test  - 	Precision:        0.667
2020-06-09 19:59:27 INFO     rasa.core.test  - 	Accuracy:         0.667
2020-06-09 19:59:27 INFO     rasa.core.test  - 	In-data fraction: 0

This time, I don’t see any results for the CONVERSATION_LEVEL tests, but I do see results for the end-to-end tests.

Unfortunately, in the docker scenario, the end-to-end tests always fail for me, even though I’m pretty positive that they should pass. In this example, my tests/conversation_tests.md file has the following story: ## Hi * greet: hi - action_greet

I know that this story works when I talk to my bot through the front end, through Rasa-X, and through api calls using Postman.

So overall, I’m surer that I’m calling ‘rasa test’ incorrectly. But I’m not sure why I’m getting a difference when I use this command locally or from within a container, or how to fix this issue. Thanks in advance for the help.

hi @plurn,

  • how are you calling rasa test in both cases ?
  • which Rasa Open Source version are you using?
  • how does your project layout look like?

Thanks :pray:

My project as the default layout with stories and nlu data in a data folder, trained model in the models folder, and the domain, config, etc. in the root folder. Since the project is in the default layout, I was just running ‘rasa test’ with no flags.

I just realized that there is a discrepancy between the Rasa versions that I was using between my dockerized and local rasa instances, which I’m pretty sure was causing the differences in results.

I’ve also figured out what was causing the failures. I was accidentally looking in the intent_errors.json file instead failed_stories.md results file.

So I think I’ve pretty much figured out all of the issues regarding my initial question. But since I have you here, I have a question about why my stories are failing.

The error in my stories looks like this:

* greet: hi
    - action_greet
    - action_listen   <!-- predicted: action_default_ask_affirmation -->

I’m not sure why the action_defualt_ask_affirmation is being called, especially because the bot is successful with calling the action_greet action and in conversing with my bot myself, it doesn’t behave as if the action_default_ask_affirmation fallback action is called when I type ‘hi’. I’m assuming it has to do with the fact that action_greet returns [UserUtteranceReverted()] because I followed this example from a rasa tutorial that calls action_greet using the Mapping Policy. Here is my code for action_greet:

class ActionGreetUser(Action):
    def name(self):
        return "action_greet"

    def run(self, dispatcher, tracker, domain):
        dispatcher.utter_message(template="utter_greet")
        return [UserUtteranceReverted()]

And here is the relevant part of the domain file:

intents:
- greet:
    triggers: action_greet

The problem is that custom actions aren’t executed during end-to-end testing (see the note in here).

I think your story should look like this:

* greet: hi
    - action_greet
    - rewind <!-- usually returned by your custom action but you have to manually specify it for the tests -->