Apache Spark Operators

Prerequisite

SparkJDBCOperator

Launches applications on a Apache Spark server, it uses SparkSubmitOperator to perform data transfers to/from JDBC-based databases.

For parameter definition take a look at SparkJDBCOperator.

Using the operator

Using cmd_type parameter, is possible to transfer data from Spark to a database (spark_to_jdbc) or from a database to Spark (jdbc_to_spark), which will write the table using the Spark command saveAsTable.

tests/system/apache/spark/example_spark_dag.py[source]

jdbc_to_spark_job = SparkJDBCOperator(
    cmd_type="jdbc_to_spark",
    jdbc_table="foo",
    spark_jars="${SPARK_HOME}/jars/postgresql-42.2.12.jar",
    jdbc_driver="org.postgresql.Driver",
    metastore_table="bar",
    save_mode="overwrite",
    save_format="JSON",
    task_id="jdbc_to_spark_job",
)

spark_to_jdbc_job = SparkJDBCOperator(
    cmd_type="spark_to_jdbc",
    jdbc_table="foo",
    spark_jars="${SPARK_HOME}/jars/postgresql-42.2.12.jar",
    jdbc_driver="org.postgresql.Driver",
    metastore_table="bar",
    save_mode="append",
    task_id="spark_to_jdbc_job",
)

Reference

For further information, look at Apache Spark DataFrameWriter documentation.

PySparkOperator

Launches applications on a Apache Spark Connect server or directly in a standalone mode

For parameter definition take a look at PySparkOperator.

Using the operator

tests/system/apache/spark/example_spark_dag.py[source]

def my_pyspark_job(spark):
    df = spark.range(100).filter("id % 2 = 0")
    print(df.count())

spark_pyspark_job = PySparkOperator(
    python_callable=my_pyspark_job, conn_id="spark_connect", task_id="spark_pyspark_job"
)

Reference

For further information, look at Running the Spark Connect Python.

SparkPipelinesOperator

Execute Spark Declarative Pipelines using the spark-pipelines CLI. This operator wraps the spark-pipelines binary to execute declarative data pipelines, supporting both pipeline execution and validation through dry-runs.

For parameter definition take a look at SparkPipelinesOperator.

Using the operator

The operator can be used to run declarative pipelines:

from airflow.providers.apache.spark.operators.spark_pipelines import SparkPipelinesOperator

# Execute the pipeline
run_pipeline = SparkPipelinesOperator(
    task_id="run_pipeline",
    pipeline_spec="/path/to/pipeline.yml",
    pipeline_command="run",
    conn_id="spark_default",
    num_executors=2,
    executor_cores=4,
    executor_memory="2G",
    driver_memory="1G",
)

Pipeline Specification

The pipeline_spec parameter should point to a YAML file defining your declarative pipeline:

name: my_pipeline
storage: file:///path/to/pipeline-storage
libraries:
  - glob:
      include: transformations/**

Pipeline Commands

  • run - Execute the pipeline (default)

  • dry-run - Validate the pipeline without execution

Reference

For further information, look at Spark Declarative Pipelines Programming Guide.

SparkSqlOperator

Launches applications on a Apache Spark server, it requires that the spark-sql script is in the PATH. The operator will run the SQL query on Spark Hive metastore service, the sql parameter can be templated and be a .sql or .hql file.

For parameter definition take a look at SparkSqlOperator.

Using the operator

tests/system/apache/spark/example_spark_dag.py[source]

spark_sql_job = SparkSqlOperator(
    sql="SELECT COUNT(1) as cnt FROM temp_table", master="local", task_id="spark_sql_job"
)

Reference

For further information, look at Running the Spark SQL CLI.

SparkSubmitOperator

Launches applications on a Apache Spark server, it uses the spark-submit script that takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports.

For parameter definition take a look at SparkSubmitOperator.

Using the operator

tests/system/apache/spark/example_spark_dag.py[source]

submit_job = SparkSubmitOperator(
    application="${SPARK_HOME}/examples/src/main/python/pi.py", task_id="submit_job"
)

Reference

For further information, look at Apache Spark submitting applications.

Was this entry helpful?