Data partitioning / sharding for training

How feasible is it to adjust Rasa for distributed training o train asynchronously by means of data partitioning / sharding? I know TensorFlow has functionality like Parameter servers, but how much would need to be changed (relatively speaking, obviously not looking for an exact amount) for this?

I imagine, for example, Processing Unit A (Unit_a) having data on points A, B, C, D (P_a → P_d), and of course the destination point X (P_x) and Processing Unit B (Unit_b) having data on points E, F, and G (P_e → P_g) and also destination point X (P_x). During one round, Unit_a and Unit_b both locate the most accurate path to P_x. Unit_a finds the most accurate path to be P_a → P_d → P_b → P_c → P_x, and Unit_b find the most accurate path to be P_f → P_e → P_g → P_x. After that round, Unit_a receives the data for points used by Unit_b, and Unit_b receives the data for points used by Unit_a, and another round is performed making the computation with their new datasets. (in multi-node each node would receive the data partition of another node until all dataset swaps have been exhausted). Eventually, the most accurate path (P_d → P_g → P_b → P_e → P_f → P_x) is found and is used.

It seems like this methodology is supported in TensorFlow, but how much change would need to be performed in the codebase of Rasa to take advantage of that? I imagine, it’s not just a simple flip of a switch in a call somewhere to the TF library.