More information on rasa core model training process

Where can I find more information about the training process of rasa core?

Unfortunately, I find this whole process very opaque at the moment and suspect that it might interfere with our bot performance. Here are some specific issues and questions I encountered:

  • What exactly happens during the data preparation (augmentation?) phase?
  • Is the graph created with the visualization facility the same that is used for training or is it completely unrelated?
  • Do more common examples in the stories file figure more frequently in the training data? Or is training data sampled uniformly from the constructed graph?
  • Why is the data preparation/augmentation phase called twice? I get the phase starting with β€œCreating states and action examples from collected trackers (by MaxHistoryTrackerFeaturizer)…” twice per training and it takes very long to finish for me, making training very slow. Also, is 81.29it/s a realistic number or are my stories processed very slowly?
  • I use sklearn policy with grid search and the CV scores are always > 98% accuracy but in practice the bot performs quite badly (though NLU is mostly correct). I suspect that data leaks from train to validation set (due to repeated samples?). What KPI can I use to get a realistic expectation of how well the rasa core model predicts? Or is the only way to interact with the bot and see what comes?
  • Is there an easy way to replace the training process with a custom training process?

Cheers, Benjamin

@tmbo can you give some answers to these points? Also how many stories do you have roughly that training is very slow (and what does very slow mean)?

I currently have 175 stories and training takes 4 min, of which most ctime is spent β€œcreating states and action examples from collected trackers”. This is not prohibitively slow but my impression is that this step scales worse than O(N), so if I have many more stories one day, it become too slow.

...
INFO:rasa_core.featurizers:Creating states and action examples from collected trackers (by MaxHistoryTrackerFeaturizer)...
Processed trackers: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8074/8074 [01:19<00:00, 100.96it/s, # actions=7439]
INFO:rasa_core.featurizers:Created 7439 action examples.
Processed actions: 7439it [00:00, 8181.70it/s, # examples=7243]
INFO:rasa_core.policies.memoization:Memorized 7243 unique action examples.
INFO:rasa_core.featurizers:Creating states and action examples from collected trackers (by MaxHistoryTrackerFeaturizer)...
Processed trackers: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8074/8074 [01:20<00:00, 100.50it/s, # actions=7439]
INFO:rasa_core.featurizers:Created 7439 action examples.
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:  1.1min finished
Best params: {'C': 1.0, 'penalty': 'l1'}
INFO:rasa_core.policies.sklearn_policy:Done fitting sklearn policy model
INFO:rasa_core.policies.sklearn_policy:Cross validation score: 0.98387
INFO:rasa_core.agent:Model directory models/policy/init exists and contains old model files. All files will be overwritten.

real	4m5.030s
user	6m53.884s
sys	0m17.836s

@akelad @tmbo any updates on this?

Hi,

  • during data generation phase, your stories are read from the file and converted to DialogueStateTracker object, this phase includes glueing together checkpointed stories. Augmentation phase concatenates full stories from start to end together, creating longer stories, it is useful if you have several short stories as training stories but during real conversation, you expect that this short stories will be combined together;
  • the graph is created from the same DialogueStateTracker objects, but it can be slightly different due to optimization for visual clarity @tmbo?;
  • training data is exactly the same as your stories, there is no sampling, it can contain additional stories if you turn on augmentation;
  • only featurization called twice, because you use two policies, which in principle could have different featurizers;
  • how close your β€œin practice” stories are to the ones you provided for training? Validation set is cut from your provided stories, so if they are β€œsimilar”, you would have high validation accuracy;
  • you need to subclass a Policy object and override its train(...) method
1 Like

Thanks for the answers, Vladimir. The content could be useful in the docs.

I have some follow-up questions:

  • We got the impression that if a checkpoint is referenced that does not exist (say bc of a typo), the corresponding story is dropped without warning. Is this correct?

  • Regarding validation scores: If I have story blocks A, B and C, could augmentation result in A+B landing in train and A+C landing in test? If so, when is the right time to split train/test to avoid this leak?

Regarding sampling stories from the graph: I’m still not quite certain how this works. To get a better grip, I tried this stories.md:


## A->B->C

* intent_A

    - action_A

    - action_B

* intent_C

    - action_C

## A->B->D

* intent_A

    - action_A

    - action_B

* intent_D

    - action_D

## A->B->E

* intent_A

    - action_A

    - action_B

* intent_E

    - action_E

## A->B->F

* intent_A

    - action_A

    - action_B

* intent_F

    - action_F

## A->B->G

* intent_A

    - action_A

    - action_B

* intent_G

    - action_G

## A->H

* intent_A

    - action_A

    - action_H

As you can see, A is followed by B 5 times and by H once. I would hope that my model then predicts that B is more likely to follow A then H is.

When I trained a simple LogisticRegression, this is my prediction after intent_A:

[{'confidence': 0.4729685936620181, 'name': 'action_B'}, {'confidence': 0.4729685936620181, 'name': 'action_H'}, {'confidence': 0.021668934602169575, 'name': 'action_listen'}]

This suggests that B and H are equally likely to be predicted after A, which is consistent with uniform sampling from the story graph (i.e. without weighted edges).

I guess that this would be desired behavior if you don’t know anything about the distribution of your stories. But in our case, we have a lot of historical training data that we want to use to enhance our stories. But it seems this wouldn’t really help, since more frequent dialogues would get the same weight as rare dialogues. With this sampling scheme, there is also the danger that a single incorrectly labelled dialogue can completely disrupt the predictions, even if there are 100 correctly labelled dialogues.

Our guess is that we need to override agent.load_data and make sure ourselves that more frequent stories are sampled more often, but this doesn’t seem to be an ideal solution. Can you help us with that?

No, if starting checkpoint is a singular one, then rasa_core emits a warning after generation.

In your example equal probabilities happen not because of sampling, but because by default we deduplicate training data during featurization. Could you try the same test with remove_duplicates=False: https://github.com/RasaHQ/rasa_core/blob/3d96fd85fc86c90af400628803c53ac1bae51565/rasa_core/featurizers.py#L546

Also, please note that, there is deduplication during generation of trackers: https://github.com/RasaHQ/rasa_core/blob/3d96fd85fc86c90af400628803c53ac1bae51565/rasa_core/training/generator.py#L153

If you don’t perform later deduplication, training times can be quite long

Thank you for your answers.

Indeed, there is a warning, it just seems to get swallowed in jupyter notebook.

MaxHistoryTrackerFeaturizer.remove_duplicates solved the problem above, thanks. I guess that this will, however, exacerbate the problem of overlap between train and test.

About train and test I do not know, I would suggest create separate test stories in separate file, and test the algorithm on them

Okay, thanks.