Rasa 2: Question about inconsistancy between what status endpoint shows about number of running trainings and actual running trainings

I’ve got one question related to the model/train and the /status api in the context of a lost session. If I initiate at training via the rest /model/train endpoint, the call to /status endpoint returns a response which contains an entry like: "num_active_training_jobs": 1

If I cancel/drop the initiating http session, the training will continue, but the /status will now contain "num_active_training_jobs": 0

On the rasa server console I see however , the training is still continuing.

If after the completion of the training I start /model/train again, I will get success 200 and the zipped model as response. On the server console I see, it is the alreaday trained model (which is perfect).

Question: what should actually happen in the situation that a http session dies and training is not yet completed? Should training also stop and num_active_training_jobs accordingly be decreased? Or should it continue but num_active_training_jobs should stay the same. Currently training is not stopped but num_active_training_jobs is decreased.

Hi @HDotzaue, thanks for bringing that up. It’s true that the behaviour should be more consistent - the num_active_training_jobs should always indicate how many training jobs are active, even if one of them has been cancelled as you did.

The part of the code that handles the training in the endpoint is here. I have not tested this so I am not 100% sure, but I believe there may be situations where the call to await loop.run_in_executor() finishes early, but the training function is still left running on the child process. This would lead to the num_active_training_jobs variable being decreased (see the finally block), but the job continuing to run.

Feel free to open an issue, or a PR, so that we can address this problem in the future. Thanks!