airflow.providers.yandex.operators.dataproc

Module Contents

Classes

InitializationAction

Data for initialization action to be run at start of DataProc cluster.

DataprocCreateClusterOperator

Creates Yandex.Cloud Data Proc cluster.

DataprocBaseOperator

Base class for DataProc operators working with given cluster.

DataprocDeleteClusterOperator

Deletes Yandex.Cloud Data Proc cluster.

DataprocCreateHiveJobOperator

Runs Hive job in Data Proc cluster.

DataprocCreateMapReduceJobOperator

Runs Mapreduce job in Data Proc cluster.

DataprocCreateSparkJobOperator

Runs Spark job in Data Proc cluster.

DataprocCreatePysparkJobOperator

Runs Pyspark job in Data Proc cluster.

class airflow.providers.yandex.operators.dataproc.InitializationAction[source]

Data for initialization action to be run at start of DataProc cluster.

uri: str[source]
args: collections.abc.Iterable[str][source]
timeout: int[source]
class airflow.providers.yandex.operators.dataproc.DataprocCreateClusterOperator(*, folder_id=None, cluster_name=None, cluster_description='', cluster_image_version=None, ssh_public_keys=None, subnet_id=None, services=('HDFS', 'YARN', 'MAPREDUCE', 'HIVE', 'SPARK'), s3_bucket=None, zone='ru-central1-b', service_account_id=None, masternode_resource_preset=None, masternode_disk_size=None, masternode_disk_type=None, datanode_resource_preset=None, datanode_disk_size=None, datanode_disk_type=None, datanode_count=1, computenode_resource_preset=None, computenode_disk_size=None, computenode_disk_type=None, computenode_count=0, computenode_max_hosts_count=None, computenode_measurement_duration=None, computenode_warmup_duration=None, computenode_stabilization_duration=None, computenode_preemptible=False, computenode_cpu_utilization_target=None, computenode_decommission_timeout=None, connection_id=None, properties=None, enable_ui_proxy=False, host_group_ids=None, security_group_ids=None, log_group_id=None, initialization_actions=None, labels=None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Creates Yandex.Cloud Data Proc cluster.

Parameters
  • folder_id (str | None) – ID of the folder in which cluster should be created.

  • cluster_name (str | None) – Cluster name. Must be unique inside the folder.

  • cluster_description (str | None) – Cluster description.

  • cluster_image_version (str | None) – Cluster image version. Use default.

  • ssh_public_keys (str | collections.abc.Iterable[str] | None) – List of SSH public keys that will be deployed to created compute instances.

  • subnet_id (str | None) – ID of the subnetwork. All Data Proc cluster nodes will use one subnetwork.

  • services (collections.abc.Iterable[str]) – List of services that will be installed to the cluster. Possible options: HDFS, YARN, MAPREDUCE, HIVE, TEZ, ZOOKEEPER, HBASE, SQOOP, FLUME, SPARK, SPARK, ZEPPELIN, OOZIE

  • s3_bucket (str | None) – Yandex.Cloud S3 bucket to store cluster logs. Jobs will not work if the bucket is not specified.

  • zone (str) – Availability zone to create cluster in. Currently there are ru-central1-a, ru-central1-b and ru-central1-c.

  • service_account_id (str | None) – Service account id for the cluster. Service account can be created inside the folder.

  • masternode_resource_preset (str | None) – Resources preset (CPU+RAM configuration) for the primary node of the cluster.

  • masternode_disk_size (int | None) – Masternode storage size in GiB.

  • masternode_disk_type (str | None) – Masternode storage type. Possible options: network-ssd, network-hdd.

  • datanode_resource_preset (str | None) – Resources preset (CPU+RAM configuration) for the data nodes of the cluster.

  • datanode_disk_size (int | None) – Datanodes storage size in GiB.

  • datanode_disk_type (str | None) – Datanodes storage type. Possible options: network-ssd, network-hdd.

  • computenode_resource_preset (str | None) – Resources preset (CPU+RAM configuration) for the compute nodes of the cluster.

  • computenode_disk_size (int | None) – Computenodes storage size in GiB.

  • computenode_disk_type (str | None) – Computenodes storage type. Possible options: network-ssd, network-hdd.

  • connection_id (str | None) – ID of the Yandex.Cloud Airflow connection.

  • computenode_max_count – Maximum number of nodes of compute autoscaling subcluster.

  • computenode_warmup_duration (int | None) – The warmup time of the instance in seconds. During this time, traffic is sent to the instance, but instance metrics are not collected. In seconds.

  • computenode_stabilization_duration (int | None) – Minimum amount of time in seconds for monitoring before Instance Groups can reduce the number of instances in the group. During this time, the group size doesn’t decrease, even if the new metric values indicate that it should. In seconds.

  • computenode_preemptible (bool) – Preemptible instances are stopped at least once every 24 hours, and can be stopped at any time if their resources are needed by Compute.

  • computenode_cpu_utilization_target (int | None) – Defines an autoscaling rule based on the average CPU utilization of the instance group. in percents. 10-100. By default is not set and default autoscaling strategy is used.

  • computenode_decommission_timeout (int | None) – Timeout to gracefully decommission nodes during downscaling. In seconds

  • properties (dict[str, str] | None) – Properties passed to main node software. Docs: https://cloud.yandex.com/docs/data-proc/concepts/settings-list

  • enable_ui_proxy (bool) – Enable UI Proxy feature for forwarding Hadoop components web interfaces Docs: https://cloud.yandex.com/docs/data-proc/concepts/ui-proxy

  • host_group_ids (collections.abc.Iterable[str] | None) – Dedicated host groups to place VMs of cluster on. Docs: https://cloud.yandex.com/docs/compute/concepts/dedicated-host

  • security_group_ids (collections.abc.Iterable[str] | None) – User security groups. Docs: https://cloud.yandex.com/docs/data-proc/concepts/network#security-groups

  • log_group_id (str | None) – Id of log group to write logs. By default logs will be sent to default log group. To disable cloud log sending set cluster property dataproc:disable_cloud_logging = true Docs: https://cloud.yandex.com/docs/data-proc/concepts/logs

  • initialization_actions (collections.abc.Iterable[InitializationAction] | None) – Set of init-actions to run when cluster starts. Docs: https://cloud.yandex.com/docs/data-proc/concepts/init-action

  • labels (dict[str, str] | None) – Cluster labels as key:value pairs. No more than 64 per resource. Docs: https://cloud.yandex.com/docs/resource-manager/concepts/labels

property cluster_id[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.yandex.operators.dataproc.DataprocBaseOperator(*, yandex_conn_id=None, cluster_id=None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Base class for DataProc operators working with given cluster.

Parameters
  • connection_id – ID of the Yandex.Cloud Airflow connection.

  • cluster_id (str | None) – ID of the cluster to remove. (templated)

template_fields: collections.abc.Sequence[str] = ('cluster_id',)[source]
abstract execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.yandex.operators.dataproc.DataprocDeleteClusterOperator(*, connection_id=None, cluster_id=None, **kwargs)[source]

Bases: DataprocBaseOperator

Deletes Yandex.Cloud Data Proc cluster.

Parameters
  • connection_id (str | None) – ID of the Yandex.Cloud Airflow connection.

  • cluster_id (str | None) – ID of the cluster to remove. (templated)

execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.yandex.operators.dataproc.DataprocCreateHiveJobOperator(*, query=None, query_file_uri=None, script_variables=None, continue_on_failure=False, properties=None, name='Hive job', cluster_id=None, connection_id=None, **kwargs)[source]

Bases: DataprocBaseOperator

Runs Hive job in Data Proc cluster.

Parameters
  • query (str | None) – Hive query.

  • query_file_uri (str | None) – URI of the script that contains Hive queries. Can be placed in HDFS or S3.

  • properties (dict[str, str] | None) – A mapping of property names to values, used to configure Hive.

  • script_variables (dict[str, str] | None) – Mapping of query variable names to values.

  • continue_on_failure (bool) – Whether to continue executing queries if a query fails.

  • name (str) – Name of the job. Used for labeling.

  • cluster_id (str | None) – ID of the cluster to run job in. Will try to take the ID from Dataproc Hook object if it’s specified. (templated)

  • connection_id (str | None) – ID of the Yandex.Cloud Airflow connection.

execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.yandex.operators.dataproc.DataprocCreateMapReduceJobOperator(*, main_class=None, main_jar_file_uri=None, jar_file_uris=None, archive_uris=None, file_uris=None, args=None, properties=None, name='Mapreduce job', cluster_id=None, connection_id=None, **kwargs)[source]

Bases: DataprocBaseOperator

Runs Mapreduce job in Data Proc cluster.

Parameters
  • main_jar_file_uri (str | None) – URI of jar file with job. Can be placed in HDFS or S3. Can be specified instead of main_class.

  • main_class (str | None) – Name of the main class of the job. Can be specified instead of main_jar_file_uri.

  • file_uris (collections.abc.Iterable[str] | None) – URIs of files used in the job. Can be placed in HDFS or S3.

  • archive_uris (collections.abc.Iterable[str] | None) – URIs of archive files used in the job. Can be placed in HDFS or S3.

  • jar_file_uris (collections.abc.Iterable[str] | None) – URIs of JAR files used in the job. Can be placed in HDFS or S3.

  • properties (dict[str, str] | None) – Properties for the job.

  • args (collections.abc.Iterable[str] | None) – Arguments to be passed to the job.

  • name (str) – Name of the job. Used for labeling.

  • cluster_id (str | None) – ID of the cluster to run job in. Will try to take the ID from Dataproc Hook object if it’s specified. (templated)

  • connection_id (str | None) – ID of the Yandex.Cloud Airflow connection.

execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.yandex.operators.dataproc.DataprocCreateSparkJobOperator(*, main_class=None, main_jar_file_uri=None, jar_file_uris=None, archive_uris=None, file_uris=None, args=None, properties=None, name='Spark job', cluster_id=None, connection_id=None, packages=None, repositories=None, exclude_packages=None, **kwargs)[source]

Bases: DataprocBaseOperator

Runs Spark job in Data Proc cluster.

Parameters
  • main_jar_file_uri (str | None) – URI of jar file with job. Can be placed in HDFS or S3.

  • main_class (str | None) – Name of the main class of the job.

  • file_uris (collections.abc.Iterable[str] | None) – URIs of files used in the job. Can be placed in HDFS or S3.

  • archive_uris (collections.abc.Iterable[str] | None) – URIs of archive files used in the job. Can be placed in HDFS or S3.

  • jar_file_uris (collections.abc.Iterable[str] | None) – URIs of JAR files used in the job. Can be placed in HDFS or S3.

  • properties (dict[str, str] | None) – Properties for the job.

  • args (collections.abc.Iterable[str] | None) – Arguments to be passed to the job.

  • name (str) – Name of the job. Used for labeling.

  • cluster_id (str | None) – ID of the cluster to run job in. Will try to take the ID from Dataproc Hook object if it’s specified. (templated)

  • connection_id (str | None) – ID of the Yandex.Cloud Airflow connection.

  • packages (collections.abc.Iterable[str] | None) – List of maven coordinates of jars to include on the driver and executor classpaths.

  • repositories (collections.abc.Iterable[str] | None) – List of additional remote repositories to search for the maven coordinates given with –packages.

  • exclude_packages (collections.abc.Iterable[str] | None) – List of groupId:artifactId, to exclude while resolving the dependencies provided in –packages to avoid dependency conflicts.

execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.yandex.operators.dataproc.DataprocCreatePysparkJobOperator(*, main_python_file_uri=None, python_file_uris=None, jar_file_uris=None, archive_uris=None, file_uris=None, args=None, properties=None, name='Pyspark job', cluster_id=None, connection_id=None, packages=None, repositories=None, exclude_packages=None, **kwargs)[source]

Bases: DataprocBaseOperator

Runs Pyspark job in Data Proc cluster.

Parameters
  • main_python_file_uri (str | None) – URI of python file with job. Can be placed in HDFS or S3.

  • python_file_uris (collections.abc.Iterable[str] | None) – URIs of python files used in the job. Can be placed in HDFS or S3.

  • file_uris (collections.abc.Iterable[str] | None) – URIs of files used in the job. Can be placed in HDFS or S3.

  • archive_uris (collections.abc.Iterable[str] | None) – URIs of archive files used in the job. Can be placed in HDFS or S3.

  • jar_file_uris (collections.abc.Iterable[str] | None) – URIs of JAR files used in the job. Can be placed in HDFS or S3.

  • properties (dict[str, str] | None) – Properties for the job.

  • args (collections.abc.Iterable[str] | None) – Arguments to be passed to the job.

  • name (str) – Name of the job. Used for labeling.

  • cluster_id (str | None) – ID of the cluster to run job in. Will try to take the ID from Dataproc Hook object if it’s specified. (templated)

  • connection_id (str | None) – ID of the Yandex.Cloud Airflow connection.

  • packages (collections.abc.Iterable[str] | None) – List of maven coordinates of jars to include on the driver and executor classpaths.

  • repositories (collections.abc.Iterable[str] | None) – List of additional remote repositories to search for the maven coordinates given with –packages.

  • exclude_packages (collections.abc.Iterable[str] | None) – List of groupId:artifactId, to exclude while resolving the dependencies provided in –packages to avoid dependency conflicts.

execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

Was this entry helpful?