Current_app.agent.handle(msg) hangs in Flask but only in production mode

This is a head-srachter, I am still trying to figure out where exactly the issue lies, but I thought I drop a quick question here, maybe somebody else has something like this before

I have a full Flask app containerized in Docker which is calling an external NLU (also containerized). Everything works fine in dev mode (dev mode not containerized, only in Python virtual env), but in production the current_app.agent.handle(msg) hangs.

I can see in the logs of the NLU that the message was sent to NLU, so I don’t understand why it hangs.

I have tested the multi-threaded code from https://github.com/RasaHQ/rasa_core/issues/817 to see whether it is a threading issue (which would explain why it works in dev mode but not in prod), but the threading seems to work fine in the container.

Any ideas?

I guess I just keep diving in trying to figure out where it hangs…

It seems this has to do something with the action probability prediction of one of the policies, because it gets stuck here: rasa_core/policies/ensemble.py#L215

I have 3 policies: MemoizationPolicy(), KerasPolicy(), fallback

It only gets stuck when trying to predict with the KerasPolicy, so it must be something wrong with that.

It is fixed now!

In Flask I had to move my Rasa Agent setup code (which is using Keras with Tensorflow backend to predict) to the @app.before_first_request and then store the agent in the app variable (current_app) to fix this issue for my project.

This really is a Keras bug (or feature), lot of people have the same issue, see https://github.com/keras-team/keras/issues/2397

A footnote in the Rasa docs might be helpful for future users facing the same.