Data not persisting for train_MemoizationPolicy0

Background: After distributing the Dask graph to a Ray cluster with only a single node (discussed in this post) , I have tried the same code with multi-node cluster but have been met with this error:

File "/home/azureuser/bot/Raysa-Rasa/rasa/core/policies/memoization.py", line 184, in train
self.persist()
File "/home/azureuser/bot/Raysa-Rasa/rasa/core/policies/memoization.py", line 269, in persist
with self._model_storage.write_to(self._resource) as path:
File "/home/azureuser/miniconda3/envs/raysa_env/lib/python3.7/contextlib.py", line 112, in __enter__
return next(self.gen)
File "/home/azureuser/bot/Raysa-Rasa/rasa/engine/storage/local_model_storage.py", line 121, in write_to
directory.mkdir()
File "/home/azureuser/miniconda3/envs/raysa_env/lib/python3.7/pathlib.py", line 1273, in mkdir
self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp6i3dehxe/train_MemoizationPolicy0'

Pathlib’s mkdir fails to make a directory for Memoization policy. What’s strange here is that it succeeds up to this policy but not this one, even though their persistence to storage is handled by the same LocalModelStorage’s write_to method. I have tried setting the parameter parents=True to prevent FileNotFoundError and that did not work. I get the same exact error as if nothing changed.

Here is the complete error output error_output.txt (22.5 KB). Any help or pointing out where should I look would be greatly appreciated!

Edit 1: should I rebuild the bot from source?

Hi @toza-mimoza

Really cool to see the progress you’ve made with distributing the dask training! This is something we have had in mind for the future, but not properly looked into yet - so we expect there to be challenges involved.

In this example, which policies have succeeded to persist?

Hi @jjuzl ,

I managed to solve it and it was an easy fix. Nothing was wrong with my code. I did not pull the code on two other nodes and workers could not serialize policies there.