airflow.providers.apache.spark.operators.spark_pipelines

Classes

SparkPipelinesOperator

Execute Spark Declarative Pipelines using the spark-pipelines CLI.

Module Contents

class airflow.providers.apache.spark.operators.spark_pipelines.SparkPipelinesOperator(*, pipeline_spec=None, pipeline_command='run', conf=None, conn_id='spark_default', num_executors=None, executor_cores=None, executor_memory=None, driver_memory=None, verbose=False, env_vars=None, deploy_mode=None, yarn_queue=None, keytab=None, principal=None, openlineage_inject_parent_job_info=conf.getboolean('openlineage', 'spark_inject_parent_job_info', fallback=False), openlineage_inject_transport_info=conf.getboolean('openlineage', 'spark_inject_transport_info', fallback=False), **kwargs)[source]

Bases: airflow.providers.common.compat.sdk.BaseOperator

Execute Spark Declarative Pipelines using the spark-pipelines CLI.

This operator wraps the spark-pipelines binary to execute declarative data pipelines. It supports running pipelines, dry-runs for validation, and initializing new pipeline projects.

See also

For more information on Spark Declarative Pipelines, see the guide: https://spark.apache.org/docs/latest/declarative-pipelines-programming-guide.html

Parameters:
  • pipeline_spec (str | None) – Path to the pipeline specification file (YAML). (templated)

  • pipeline_command (str) – The spark-pipelines command to execute (‘run’, ‘dry-run’). Default is ‘run’.

  • conf (dict[Any, Any] | None) – Arbitrary Spark configuration properties (templated)

  • conn_id (str) – The spark connection id as configured in Airflow administration. When an invalid connection_id is supplied, it will default to yarn.

  • num_executors (int | None) – Number of executors to launch

  • executor_cores (int | None) – Number of cores per executor (Default: 2)

  • executor_memory (str | None) – Memory per executor (e.g. 1000M, 2G) (Default: 1G)

  • driver_memory (str | None) – Memory allocated to the driver (e.g. 1000M, 2G) (Default: 1G)

  • verbose (bool) – Whether to pass the verbose flag to spark-pipelines process for debugging

  • env_vars (dict[str, Any] | None) – Environment variables for spark-pipelines. (templated)

  • deploy_mode (str | None) – Whether to deploy your driver on the worker nodes (cluster) or locally as a client.

  • yarn_queue (str | None) – The name of the YARN queue to which the application is submitted.

  • keytab (str | None) – Full path to the file that contains the keytab (templated)

  • principal (str | None) – The name of the kerberos principal used for keytab (templated)

  • openlineage_inject_parent_job_info (bool) – Whether to inject OpenLineage parent job information

  • openlineage_inject_transport_info (bool) – Whether to inject OpenLineage transport information

template_fields: collections.abc.Sequence[str] = ('pipeline_spec', 'conf', 'env_vars', 'keytab', 'principal')[source]
pipeline_spec = None[source]
pipeline_command = 'run'[source]
conf = None[source]
num_executors = None[source]
executor_cores = None[source]
executor_memory = None[source]
driver_memory = None[source]
verbose = False[source]
env_vars = None[source]
deploy_mode = None[source]
yarn_queue = None[source]
keytab = None[source]
principal = None[source]
execute(context)[source]

Execute the SparkPipelinesHook to run the specified pipeline command.

on_kill()[source]

Override this method to clean up subprocesses when a task instance gets killed.

Any use of the threading, subprocess or multiprocessing module within an operator needs to be cleaned up, or it will leave ghost processes behind.

property hook: airflow.providers.apache.spark.hooks.spark_pipelines.SparkPipelinesHook[source]

Was this entry helpful?