airflow.providers.google.cloud.triggers.dataproc

This module contains Google Dataproc triggers.

Module Contents

Classes

DataprocBaseTrigger

Base class for Dataproc triggers.

DataprocSubmitTrigger

DataprocSubmitTrigger run on the trigger worker to perform create Build operation.

DataprocClusterTrigger

DataprocClusterTrigger run on the trigger worker to perform create Build operation.

DataprocBatchTrigger

DataprocCreateBatchTrigger run on the trigger worker to perform create Build operation.

DataprocDeleteClusterTrigger

DataprocDeleteClusterTrigger run on the trigger worker to perform delete cluster operation.

DataprocOperationTrigger

Trigger that periodically polls information on a long running operation from Dataproc API to verify status.

class airflow.providers.google.cloud.triggers.dataproc.DataprocBaseTrigger(region, project_id=PROVIDE_PROJECT_ID, gcp_conn_id='google_cloud_default', impersonation_chain=None, polling_interval_seconds=30, cancel_on_kill=True, delete_on_error=True)[source]

Bases: airflow.triggers.base.BaseTrigger

Base class for Dataproc triggers.

get_async_hook()[source]
get_sync_hook()[source]
class airflow.providers.google.cloud.triggers.dataproc.DataprocSubmitTrigger(job_id, **kwargs)[source]

Bases: DataprocBaseTrigger

DataprocSubmitTrigger run on the trigger worker to perform create Build operation.

Parameters
  • job_id (str) – The ID of a Dataproc job.

  • project_id – Google Cloud Project where the job is running

  • region – The Cloud Dataproc region in which to handle the request.

  • gcp_conn_id – Optional, the connection ID used to connect to Google Cloud Platform.

  • impersonation_chain – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

  • polling_interval_seconds – polling period in seconds to check for the status

serialize()[source]

Return the information needed to reconstruct this Trigger.

Returns

Tuple of (class path, keyword arguments needed to re-instantiate).

get_task_instance(session)[source]

Get the task instance for the current task.

Parameters

session (sqlalchemy.orm.session.Session) – Sqlalchemy session

safe_to_cancel()[source]

Whether it is safe to cancel the external job which is being executed by this trigger.

This is to avoid the case that asyncio.CancelledError is called because the trigger itself is stopped. Because in those cases, we should NOT cancel the external job.

async run()[source]

Run the trigger in an asynchronous context.

The trigger should yield an Event whenever it wants to fire off an event, and return None if it is finished. Single-event triggers should thus yield and then immediately return.

If it yields, it is likely that it will be resumed very quickly, but it may not be (e.g. if the workload is being moved to another triggerer process, or a multi-event trigger was being used for a single-event task defer).

In either case, Trigger classes should assume they will be persisted, and then rely on cleanup() being called when they are no longer needed.

class airflow.providers.google.cloud.triggers.dataproc.DataprocClusterTrigger(cluster_name, **kwargs)[source]

Bases: DataprocBaseTrigger

DataprocClusterTrigger run on the trigger worker to perform create Build operation.

Parameters
  • cluster_name (str) – The name of the cluster.

  • project_id – Google Cloud Project where the job is running

  • region – The Cloud Dataproc region in which to handle the request.

  • gcp_conn_id – Optional, the connection ID used to connect to Google Cloud Platform.

  • impersonation_chain – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

  • polling_interval_seconds – polling period in seconds to check for the status

serialize()[source]

Return the information needed to reconstruct this Trigger.

Returns

Tuple of (class path, keyword arguments needed to re-instantiate).

Return type

tuple[str, dict[str, Any]]

get_task_instance(session)[source]
safe_to_cancel()[source]

Whether it is safe to cancel the external job which is being executed by this trigger.

This is to avoid the case that asyncio.CancelledError is called because the trigger itself is stopped. Because in those cases, we should NOT cancel the external job.

async run()[source]

Run the trigger in an asynchronous context.

The trigger should yield an Event whenever it wants to fire off an event, and return None if it is finished. Single-event triggers should thus yield and then immediately return.

If it yields, it is likely that it will be resumed very quickly, but it may not be (e.g. if the workload is being moved to another triggerer process, or a multi-event trigger was being used for a single-event task defer).

In either case, Trigger classes should assume they will be persisted, and then rely on cleanup() being called when they are no longer needed.

async fetch_cluster()[source]

Fetch the cluster status.

async delete_when_error_occurred(cluster)[source]

Delete the cluster on error.

Parameters

cluster (google.cloud.dataproc_v1.Cluster) – The cluster to delete.

class airflow.providers.google.cloud.triggers.dataproc.DataprocBatchTrigger(batch_id, **kwargs)[source]

Bases: DataprocBaseTrigger

DataprocCreateBatchTrigger run on the trigger worker to perform create Build operation.

Parameters
  • batch_id (str) – The ID of the build.

  • project_id – Google Cloud Project where the job is running

  • region – The Cloud Dataproc region in which to handle the request.

  • gcp_conn_id – Optional, the connection ID used to connect to Google Cloud Platform.

  • impersonation_chain – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

  • polling_interval_seconds – polling period in seconds to check for the status

serialize()[source]

Serialize DataprocBatchTrigger arguments and classpath.

async run()[source]

Run the trigger in an asynchronous context.

The trigger should yield an Event whenever it wants to fire off an event, and return None if it is finished. Single-event triggers should thus yield and then immediately return.

If it yields, it is likely that it will be resumed very quickly, but it may not be (e.g. if the workload is being moved to another triggerer process, or a multi-event trigger was being used for a single-event task defer).

In either case, Trigger classes should assume they will be persisted, and then rely on cleanup() being called when they are no longer needed.

class airflow.providers.google.cloud.triggers.dataproc.DataprocDeleteClusterTrigger(cluster_name, end_time, metadata=(), **kwargs)[source]

Bases: DataprocBaseTrigger

DataprocDeleteClusterTrigger run on the trigger worker to perform delete cluster operation.

Parameters
  • cluster_name (str) – The name of the cluster

  • end_time (float) – Time in second left to check the cluster status

  • project_id – The ID of the Google Cloud project the cluster belongs to

  • region – The Cloud Dataproc region in which to handle the request

  • metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method

  • gcp_conn_id – The connection ID to use when fetching connection info.

  • impersonation_chain – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account.

  • polling_interval_seconds – Time in seconds to sleep between checks of cluster status

serialize()[source]

Serialize DataprocDeleteClusterTrigger arguments and classpath.

async run()[source]

Wait until cluster is deleted completely.

class airflow.providers.google.cloud.triggers.dataproc.DataprocOperationTrigger(name, operation_type=None, **kwargs)[source]

Bases: DataprocBaseTrigger

Trigger that periodically polls information on a long running operation from Dataproc API to verify status.

Implementation leverages asynchronous transport.

serialize()[source]

Return the information needed to reconstruct this Trigger.

Returns

Tuple of (class path, keyword arguments needed to re-instantiate).

async run()[source]

Run the trigger in an asynchronous context.

The trigger should yield an Event whenever it wants to fire off an event, and return None if it is finished. Single-event triggers should thus yield and then immediately return.

If it yields, it is likely that it will be resumed very quickly, but it may not be (e.g. if the workload is being moved to another triggerer process, or a multi-event trigger was being used for a single-event task defer).

In either case, Trigger classes should assume they will be persisted, and then rely on cleanup() being called when they are no longer needed.

Was this entry helpful?