Parallelizing Training Mechanism in Rasa 3.0.x

toza-mimoza · February 16, 2022, 3:23pm

Hi!

Suppose the following diagram represents GraphSchema nodes representing NLU pipeline components that need to be trained constructed according to the “needs” relation:

Relations legend:

D - domain,
DCT - domain for core training,
TD - training data,
TT - training tracker, and
SG - story graph.

The motivation for this diagram is to find out which components can be run in parallel using multi-threaded or multi-node clusters (either Dask or Ray clusters).

Testing with different number of threads for training revealed that there is no significant improvement in decreasing training time past 4 threads. My assumption for why is this the case is because of the 4 policies that are trained on the right side of the diagram. However, is there any room for improvement for more than 4 threads? Is my assumption why 4 threads are most optimal logical and/or correct?

The diagram is constructed using this output of GraphSchema: output_graph_schema.txt (7.6 KB)

Note: I have disabled cache to test full training optimization. Btw: What are “Resource” relations in this file above?

Hi @koaning !

koaning · February 16, 2022, 6:03pm

@Tobias_Wochinger might be able to go more into the details here.

One thing I fear though: both DIET and TED will try to take all the CPUs from the worker machine in order to train quicker. If these two algorithms were to train at the same time, I can imagine that the total training time might decrease. Most of the other components besides DIET/TED don’t take a lot of compute time, so I’m wondering how effective this approach would be.

toza-mimoza · February 16, 2022, 6:10pm

The decrease of training time is significant around 12 seconds (from 36 to 24s) on a local machine and using slower Azure VMs it decreased roughly from 55 to 42. Current Rasa training only uses a single-threaded Dask graph runner. Just by using two instead of one thread decreases the training time by whole 8 seconds.

koaning · February 16, 2022, 6:25pm

That seems interesting. But how large is your training set? I might imagine that this effect is more pronounced on larger datasets.

toza-mimoza · February 16, 2022, 6:46pm

Since this experiment was for my thesis, I used only the initial bare minimum dataset provided by Rasa. This diagram above will be included in the thesis but I don’t know if it’s correct.

The biggest downside is I had to disable caching not only because I wanted to test the full training but also because cache contains an SQLAlchemy object which is not serializable between nodes in the Dask/Ray cluster. I suppose that is not a problem for future work.

souvikg10 · February 16, 2022, 8:43pm

@toza-mimoza - if you have a branch on github on the modifications you have made,i will be happy to test it out I tried the same with multithreading but then we can’t distribute DIET and TED given they take up the entire CPU but if they are across multiple machines using Ray, then things become interesting

I remember marshmellow for JSON serialization of SQL objects in Java, i am quite sure there is something like this in Python.

toza-mimoza · February 16, 2022, 8:56pm

Sure, hop on board : Ray Distributed Rasa Github Repo.

Topic		Replies	Views
Dask and Rasa Rasa Open Source	4	715	February 3, 2022
Rasa train taking lot of time Rasa Open Source	22	4913	July 6, 2021
Multi-thread training Rasa Open Source	2	994	June 2, 2020
How to only train Rasa core on CPU and only Rasa NLU on GPU Rasa Open Source	0	1757	April 23, 2020
DIETClassifier: Slow training (like 10 hours) Rasa Open Source	2	1195	May 14, 2020

Parallelizing Training Mechanism in Rasa 3.0.x

Related topics