concurrent.futures._base.CancelledError with long running custom actions

Hello!

I have a setup where I’m running rasa docker images (version 1.10.8) for rasa itself and the other one for action server. We have had this system running for quite a while without problems so I think the base setup is fine. Lately when we are developing custom actions that take more than 10s we have started to get these CancelledError exceptions. I have tried to set --response-timeout etc. but it doesn’t seem to help. I have debugged issue some time now and it seems that the main rasa container terminates the TCP connection between itself and action server around 10s after it has send request to the action server. Exception starts from aiohttp/streams.py if I studied it correctly.

My setup is running on my laptop in VirtualBox, there I have CentOS linux image. In that linux image I have docker swarm running and that launches then rasa containers and database, nginx, … and other neede containers.

I have also tried to check tcp keepalives etc. because I know these can have some issue especially with docker swarm.

Has anyone else run into this kind of issue? Is there any solution for this problem?

BR,

Here is the last piece of code I can get into, as you can see state is CANCELLED…

Just to be sure I created new docker container setup with Rasa 1.10.8-full version. I implemented custom action called action_joke where I simply sleep for 60 seconds. Around 10 seconds this appears onto the shell container:

2020-08-12 13:45:37 DEBUG    rasa.core.actions.action  - Calling action endpoint to run action 'action_joke'.
2020-08-12 13:45:48 ERROR    asyncio  - Task exception was never retrieved
future: <Task finished coro=<configure_app.<locals>.run_cmdline_io() done, defined at /opt/venv/lib/python3.7/site-packages/rasa/core/run.py:128> exception=TimeoutError()>
Traceback (most recent call last):
  File "/opt/venv/lib/python3.7/site-packages/rasa/core/run.py", line 134, in run_cmdline_io
sender_id=conversation_id,
  File "/opt/venv/lib/python3.7/site-packages/rasa/core/channels/console.py", line 151, in record_messages
async for response in bot_responses:
  File "/opt/venv/lib/python3.7/site-packages/rasa/core/channels/console.py", line 107, in send_message_receive_stream
async for line in resp.content:
  File "/opt/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 39, in __anext__
rv = await self.read_func()
  File "/opt/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 328, in readline
await self._wait('readline')
  File "/opt/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 296, in _wait
await waiter
  File "/opt/venv/lib/python3.7/site-packages/aiohttp/helpers.py", line 596, in __exit__
raise asyncio.TimeoutError from None
concurrent.futures._base.TimeoutError
[2020-08-12 13:45:48 +0000] [1] [ERROR] Transport closed @ ('127.0.0.1', 43020) and exception experienced during error handling

Hi Timo,

You should be able to increase the Sanic timeout as reported here.

export SANIC_RESPONSE_TIMEOUT=360
export SANIC_REQUEST_TIMEOUT=360

Greg

Hi!

Thank’s for your reply Greg!

I think those environment variables have only meaning with timeouts happening between actual Rasa and the “chat client”, Slack in that case. I think that my timeout problem is between actual Rasa and the action server.

I started the “rasa shell” container with command:

docker run -u root -ti --network rasatest --rm -e SANIC_RESPONSE_TIMEOUT=360 -e SANIC_REQUEST_TIMEOUT=360 --name rasa -p 5005:5005 -v $(pwd):/app rasa/rasa:1.10.8-full shell --debug --enable-api

and the action server container with command:

docker run -u root -ti --network rasatest --rm --name actionserver -e SANIC_RESPONSE_TIMEOUT=360 -e SANIC_REQUEST_TIMEOUT=360 -p 5055:5055 -v $(pwd):/app rasa/rasa:1.10.8-full run --debug --enable-api actions

I have same behaviour in the shell, it goes into timeout after about 10s

2020-08-13 04:34:37 DEBUG    rasa.core.processor  - Predicted next action 'action_joke' with confidence 1.00.
2020-08-13 04:34:37 DEBUG    rasa.core.actions.action  - Calling action endpoint to run action 'action_joke'.
2020-08-13 04:34:48 ERROR    asyncio  - Task exception was never retrieved
future: <Task finished coro=<configure_app.<locals>.run_cmdline_io() done, defined at /opt/venv/lib/python3.7/site-packages/rasa/core/run.py:128> exception=TimeoutError()>
Traceback (most recent call last):
File "/opt/venv/lib/python3.7/site-packages/rasa/core/run.py", line 134, in run_cmdline_io
sender_id=conversation_id,
File "/opt/venv/lib/python3.7/site-packages/rasa/core/channels/console.py", line 151, in record_messages
async for response in bot_responses:
File "/opt/venv/lib/python3.7/site-packages/rasa/core/channels/console.py", line 107, in send_message_receive_stream
async for line in resp.content:
File "/opt/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 39, in __anext__
rv = await self.read_func()
File "/opt/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 328, in readline
await self._wait('readline')
File "/opt/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 296, in _wait
await waiter
File "/opt/venv/lib/python3.7/site-packages/aiohttp/helpers.py", line 596, in __exit__
raise asyncio.TimeoutError from None
concurrent.futures._base.TimeoutError
[2020-08-13 04:34:48 +0000] [1] [ERROR] Transport closed @ ('127.0.0.1', 43304) and exception experienced during error handling

I tried these env variables also in my more complex setup with the same result.

Hmm… I ran also this command in the container:

# This is where all the rasa stuff should be, right?
cd /opt/venv

# Search for SANIC_RESPONSE_TIMEOUT
grep -r SANIC_RESPONSE_TIMEOUT .

# and the result is empty, so that environment variable doesn't seem to have any effect in Rasa???

Last trial, I made a new Rasa image where I removed sanic and uvloop and then reinstalled sanic without uvloop, still no success.

No I’m out of ideas…

Is there anyone who has custom actions running for example for 30 secs without problems?

Do we need to implement something that is just kicked by the custom action to do heavy work and eventually respond to the msteams discussion, and Rasa custom action would just reply "This will take a looong time, sit tight… " after it has spawned that external job?

Ok, if I run shell in the other container then longer timeout for custom actions can be changed as described here:

Have to do some debugging for botframework channel that how I could the same there… and why it seems to be the same 10sec there…

Any bright ideas?

To go around this “feature” we will change our custom action to send message into RabbitMQ or similar and then something to consume those messages and to send replies into msteams discussion.

Looking at the Rasa Core code, it looks like the client timeout is set to 5 minutes: rasa/constants.py at 1.10.8 · RasaHQ/rasa · GitHub . So this is not Rasa Core timing out, it’s the actions server that is cancelling itself, and not sending a response.

For SANIC_RESPONSE_TIMEOUT, are you sure that you’re setting that for the actions server, not for the core server?

I’m quite sure that action server is not cancelling anything, there is no network traffic between these two containers when timeout/cancellation happens.

action server is still running the time.sleep(60) when that timeout occurs.

I “grepped” the code on both, action server and “main” rasa server containers and I couldn’t find any reference in the source code for SANIC_RESPONSE_TIMEOUT or SANIC_REQUEST_TIMEOUT.

This exception is coming either from aiohttp or from something behind it.

To clarify, I have one container for the “main” rasa and the other one for the action server.

Some update to this, SANIC_RESPONSE_TIMEOUT and SANIC_REQUEST_TIMEOUT and used by Sanic library, therefore not visible in the Rasa source code…

This is the last line I could goto with debugger, here it stays about 10s and then exception;

This future is cancelled, but why?

I see you’re using the console to test this. I found this in the console code:

Used here: rasa/console.py at 1.10.8 · RasaHQ/rasa · GitHub

If I’m understanding this correctly, for the CLI tool, what happens is that Rasa Core starts up an HTTP server, and then the CLI tool interacts with that.

The CLI tool itself has a timeout for how long it waits for a response from Rasa Core, and that is the timeout that is triggering here.

So Timeout error on rasa shell · Issue #4606 · RasaHQ/rasa · GitHub seems to be on the right track. I would guess then that this is a channel issue, not an issue with rasa core or the action server timing out.

What channel implementation are you using?

I’m using the botframework channel.

I tried also that setting you mentioned with console channel and it works there.

I think then botframework might be cancelling its request to Rasa after a timeout? Although I don’t know enough about botframework to be able to know what the timeout there is, or how to configure it.

The only thing I could find is this, which mentions 5 second timeouts: Add custom bots to Microsoft Teams with outgoing webhooks - Teams | Microsoft Docs

I have seen this aiohttp ClientTimeout structure passed from Rasa towards HTTP connection with the action server. It has total timeout of 300 seconds which should be more than enough. For som reason on the lower levels on the HTTP connection is some sort of timeout about after 10 secs. It is just somewhere so deep that I wonder if it is caused by my infrastructure somehow… ie VirtualBox + Docker Swarm doing something strange…

Same behaviour also with 1.10.10 version of Rasa

Any further update on this issue, I am also facing same issue with rasa version 1.8.1. Looks like Slack sends multiple retry messages to the chatbot. Tried various fix from the comments by others, but still no luck. Any version with fix for this issue?

Thank you in advance for the help on it

rasa_1 | 2020-09-23 19:37:08 DEBUG rasa.core.policies.ensemble - Predicted next action using policy_1_KerasPolicy rasa_1 | 2020-09-23 19:37:08 DEBUG rasa.core.processor - Predicted next action ‘action_execute_report’ with confidence 1.00. rasa_1 | 2020-09-23 19:37:08 DEBUG rasa.core.actions.action - Calling action endpoint to run action ‘action_execute_report’. rasa_1 | 2020-09-23 19:37:11 ERROR rasa.core.processor - Encountered an exception while running action ‘action_execute_report’. Bot will continue, but the actions events are lrasa_1 | 2020-09-23 19:37:11 DEBUG rasa.core.processor - rasa_1 | Traceback (most recent call last): rasa_1 | File “/opt/venv/lib/python3.6/site-packages/rasa/core/processor.py”, line 596, in _run_action rasa_1 | events = await action.run(output_channel, nlg, tracker, self.domain) rasa_1 | File “/usr/local/lib/python3.6/asyncio/coroutines.py”, line 129, in throw rasa_1 | return self.gen.throw(type, value, traceback) rasa_1 | File “/opt/venv/lib/python3.6/site-packages/rasa/core/actions/action.py”, line 549, in run rasa_1 | json=json_body, method=“post”, timeout=DEFAULT_REQUEST_TIMEOUT rasa_1 | File “/usr/local/lib/python3.6/asyncio/coroutines.py”, line 129, in throw rasa_1 | return self.gen.throw(type, value, traceback) rasa_1 | File “/opt/venv/lib/python3.6/site-packages/rasa/utils/endpoints.py”, line 146, in request rasa_1 | **kwargs, rasa_1 | File “/usr/local/lib/python3.6/asyncio/coroutines.py”, line 129, in throw rasa_1 | return self.gen.throw(type, value, traceback) rasa_1 | File “/opt/venv/lib/python3.6/site-packages/aiohttp/client.py”, line 1012, in aenter rasa_1 | self._resp = await self._coro rasa_1 | File “/usr/local/lib/python3.6/asyncio/coroutines.py”, line 129, in throw rasa_1 | return self.gen.throw(type, value, traceback) rasa_1 | File “/opt/venv/lib/python3.6/site-packages/aiohttp/client.py”, line 504, in _request rasa_1 | await resp.start(conn) rasa_1 | File “/usr/local/lib/python3.6/asyncio/coroutines.py”, line 129, in throw rasa_1 | return self.gen.throw(type, value, traceback) rasa_1 | File “/opt/venv/lib/python3.6/site-packages/aiohttp/client_reqrep.py”, line 847, in start rasa_1 | message, payload = await self._protocol.read() # type: ignore # noqa rasa_1 | File “/usr/local/lib/python3.6/asyncio/coroutines.py”, line 129, in throw rasa_1 | return self.gen.throw(type, value, traceback) rasa_1 | File “/opt/venv/lib/python3.6/site-packages/aiohttp/streams.py”, line 591, in read rasa_1 | await self._waiter rasa_1 | concurrent.futures._base.CancelledError rasa_1 | 2020-09-23 19:37:11 DEBUG rasa.core.processor - Action ‘action_execute_report’ ended with events ‘[]’.