Airflow Summit 2025 is coming October 07-09. Register now to secure your spot!

Troubleshooting

Obscure task failures

Task state changed externally

There are many potential causes for a task’s state to be changed by a component other than the executor, which might cause some confusion when reviewing task instance or scheduler logs.

Below are some example scenarios that could cause a task’s state to change by a component other than the executor:

Process terminated by signal

Sometimes, Airflow or some adjacent system will kill a task instance’s TaskRunner, causing the task instance to fail.

Below we discuss a few common cases.

Dag run timeout

A dag run timeout can be specified by dagrun_timeout in the dag’s definition. The task process would likely be killed with SIGTERM (exit code -15).

Out of memory error (OOM)

When a task process consumes too much memory for a worker, the best case scenario is it is killed with SIGKILL (exit code -9). Depending on configuration and infrastructure, it is also possible that the whole worker will be killed due to OOM and then the tasks would be marked as failed after failing to heartbeat.

Lingering task supervisor processes

Under very high concurrency the socket handlers inside the task supervisor may miss the final EOF events from the task process. When this occurs the supervisor believes sockets are still open and will not exit. The workers.socket_cleanup_timeout option controls how long the supervisor waits after the task finishes before force-closing any remaining sockets. If you observe leftover supervisor processes, consider increasing this delay.

Was this entry helpful?