Suppose the following diagram represents GraphSchema nodes representing NLU pipeline components that need to be trained constructed according to the “needs” relation:
The motivation for this diagram is to find out which components can be run in parallel using multi-threaded or multi-node clusters (either Dask or Ray clusters).
Testing with different number of threads for training revealed that there is no significant improvement in decreasing training time past 4 threads. My assumption for why is this the case is because of the 4 policies that are trained on the right side of the diagram.
However, is there any room for improvement for more than 4 threads?
Is my assumption why 4 threads are most optimal logical and/or correct?
The diagram is constructed using this output of GraphSchema:
output_graph_schema.txt (7.6 KB)
Note: I have disabled cache to test full training optimization.
Btw: What are “Resource” relations in this file above?
One thing I fear though: both DIET and TED will try to take all the CPUs from the worker machine in order to train quicker. If these two algorithms were to train at the same time, I can imagine that the total training time might decrease. Most of the other components besides DIET/TED don’t take a lot of compute time, so I’m wondering how effective this approach would be.
The decrease of training time is significant around 12 seconds (from 36 to 24s) on a local machine and using slower Azure VMs it decreased roughly from 55 to 42. Current Rasa training only uses a single-threaded Dask graph runner. Just by using two instead of one thread decreases the training time by whole 8 seconds.
Since this experiment was for my thesis, I used only the initial bare minimum dataset provided by Rasa. This diagram above will be included in the thesis but I don’t know if it’s correct.
The biggest downside is I had to disable caching not only because I wanted to test the full training but also because cache contains an SQLAlchemy object which is not serializable between nodes in the Dask/Ray cluster. I suppose that is not a problem for future work.
@toza-mimoza - if you have a branch on github on the modifications you have made,i will be happy to test it out I tried the same with multithreading but then we can’t distribute DIET and TED given they take up the entire CPU but if they are across multiple machines using Ray, then things become interesting
I remember marshmellow for JSON serialization of SQL objects in Java, i am quite sure there is something like this in Python.