airflow.providers.apache.spark.operators.spark_pipelines¶
Classes¶
Execute Spark Declarative Pipelines using the spark-pipelines CLI. |
Module Contents¶
- class airflow.providers.apache.spark.operators.spark_pipelines.SparkPipelinesOperator(*, pipeline_spec=None, pipeline_command='run', conf=None, conn_id='spark_default', num_executors=None, executor_cores=None, executor_memory=None, driver_memory=None, verbose=False, env_vars=None, deploy_mode=None, yarn_queue=None, keytab=None, principal=None, openlineage_inject_parent_job_info=conf.getboolean('openlineage', 'spark_inject_parent_job_info', fallback=False), openlineage_inject_transport_info=conf.getboolean('openlineage', 'spark_inject_transport_info', fallback=False), **kwargs)[source]¶
Bases:
airflow.providers.common.compat.sdk.BaseOperatorExecute Spark Declarative Pipelines using the spark-pipelines CLI.
This operator wraps the spark-pipelines binary to execute declarative data pipelines. It supports running pipelines, dry-runs for validation, and initializing new pipeline projects.
See also
For more information on Spark Declarative Pipelines, see the guide: https://spark.apache.org/docs/latest/declarative-pipelines-programming-guide.html
- Parameters:
pipeline_spec (str | None) – Path to the pipeline specification file (YAML). (templated)
pipeline_command (str) – The spark-pipelines command to execute (‘run’, ‘dry-run’). Default is ‘run’.
conf (dict[Any, Any] | None) – Arbitrary Spark configuration properties (templated)
conn_id (str) – The spark connection id as configured in Airflow administration. When an invalid connection_id is supplied, it will default to yarn.
num_executors (int | None) – Number of executors to launch
executor_cores (int | None) – Number of cores per executor (Default: 2)
executor_memory (str | None) – Memory per executor (e.g. 1000M, 2G) (Default: 1G)
driver_memory (str | None) – Memory allocated to the driver (e.g. 1000M, 2G) (Default: 1G)
verbose (bool) – Whether to pass the verbose flag to spark-pipelines process for debugging
env_vars (dict[str, Any] | None) – Environment variables for spark-pipelines. (templated)
deploy_mode (str | None) – Whether to deploy your driver on the worker nodes (cluster) or locally as a client.
yarn_queue (str | None) – The name of the YARN queue to which the application is submitted.
keytab (str | None) – Full path to the file that contains the keytab (templated)
principal (str | None) – The name of the kerberos principal used for keytab (templated)
openlineage_inject_parent_job_info (bool) – Whether to inject OpenLineage parent job information
openlineage_inject_transport_info (bool) – Whether to inject OpenLineage transport information
- template_fields: collections.abc.Sequence[str] = ('pipeline_spec', 'conf', 'env_vars', 'keytab', 'principal')[source]¶