A way to compare different NLU Pipelines with new test data

Hi all,

I am trying to find a way to evaluate two different NLU Pipelines with new test data. Basically, rasa provides us with a cool evaluation tool described in comparing-nlu-pipelines section (Testing Your Assistant), but it is solely based on a nlu.yaml file split into training (80%) and test (20%). I want to use my nlu.yaml 100% for training data since having the prepared test data.

Seems I can utilize some arguments for what I want (Command Line Interface), but simply I am not sure about what to use for the new test data as well as pipelines comparison.

Maybe, like this?

“rasa test nlu --nlu data/nlu.yml data/new_test_data.yml –config config_1.yml config_2.yml’”

Please let me know if there is a way for this. Thanks.

@miner Well, normally that is not the practice in training model convention and machine learning models to take 100% for training . But, if you required you can use your nlu file and write the python code in juypter notebook and train the model. Hope you know the convention.

@miner Yes, please try to do the experiment as you shown in the above command.


Yes, the command is correct.

But as Nik pointed out, the standard is to usually take 80% of your data for training and 20% for testing. If your bot’s behaviour drastically changes because of 20% less data, it means you need more and/or better data in the first place.

I would also recommend setting a random_seed: If you want to accurately compare two Pipeline Components or Policies across multiple trainings, you could set a Seed for DIET, ResponseSelector, and TED like so for example:

- name: DIETClassifier
  random_seed: 1
  // other parameters

I also suggest you use Tensorboard to make comparisons and choose an optimal configuration. This is also doable on DIET, ResponseSelector, and TED like so for example:

- name: DIETClassifier
  // other parameters
  evaluate_on_number_of_examples: 200
  evaluate_every_number_of_epochs: 5
  tensorboard_log_directory: ./tensorboard/DIET
  tensorboard_log_level: epoch

Try to set evaluate_on_number_of_examples to about 20% of your total number of examples (of course, this means these examples will not be used for training and you will have to give a bit more examples). You can use this script I wrote to count the number of examples you have.

