Troubleshooting¶
Obscure task failures¶
Task state changed externally¶
There are many potential causes for a task’s state to be changed by a component other than the executor, which might cause some confusion when reviewing task instance or scheduler logs.
Below are some example scenarios that could cause a task’s state to change by a component other than the executor:
If a task’s DAG failed to parse on the worker, the scheduler may mark the task as failed. If confirmed, consider increasing core.dagbag_import_timeout and dag_processor.dag_file_processor_timeout.
The scheduler will mark a task as failed if the task has been queued for longer than scheduler.task_queued_timeout.
If a task instance’s heartbeat times out, it will be marked failed by the scheduler.
A user marked the task as successful or failed in the Airflow UI.
An external script or process used the Airflow REST API to change the state of a task.
TaskRunner killed¶
Sometimes, Airflow or some adjacent system will kill a task instance’s TaskRunner
, causing the task instance to fail.
Here are some examples that could cause such an event:
A DAG run timeout, specified by
dagrun_timeout
in the DAG’s definition.An Airflow worker running out of memory - Usually, Airflow workers that run out of memory receive a SIGKILL, and the scheduler will fail the corresponding task instance for not having a heartbeat. However, in some scenarios, Airflow kills the task before that happens.
Lingering task supervisor processes¶
Under very high concurrency the socket handlers inside the task supervisor may
miss the final EOF events from the task process. When this occurs the supervisor
believes sockets are still open and will not exit. The
workers.socket_cleanup_timeout option controls how long the supervisor
waits after the task finishes before force-closing any remaining sockets. If you
observe leftover supervisor
processes, consider increasing this delay.