Airflow Summit 2025 is coming October 07-09. Register now for early bird ticket!

Troubleshooting

Obscure task failures

Task state changed externally

There are many potential causes for a task’s state to be changed by a component other than the executor, which might cause some confusion when reviewing task instance or scheduler logs.

Below are some example scenarios that could cause a task’s state to change by a component other than the executor:

TaskRunner killed

Sometimes, Airflow or some adjacent system will kill a task instance’s TaskRunner, causing the task instance to fail.

Here are some examples that could cause such an event:

  • A DAG run timeout, specified by dagrun_timeout in the DAG’s definition.

  • An Airflow worker running out of memory - Usually, Airflow workers that run out of memory receive a SIGKILL, and the scheduler will fail the corresponding task instance for not having a heartbeat. However, in some scenarios, Airflow kills the task before that happens.

Lingering task supervisor processes

Under very high concurrency the socket handlers inside the task supervisor may miss the final EOF events from the task process. When this occurs the supervisor believes sockets are still open and will not exit. The workers.socket_cleanup_timeout option controls how long the supervisor waits after the task finishes before force-closing any remaining sockets. If you observe leftover supervisor processes, consider increasing this delay.

Was this entry helpful?