airflow.providers.google.cloud.hooks.dataproc
¶
This module contains a Google Cloud Dataproc hook.
Module Contents¶
Classes¶
A helper class for building Dataproc job. |
|
Google Cloud Dataproc APIs. |
|
Asynchronous interaction with Google Cloud Dataproc APIs. |
- exception airflow.providers.google.cloud.hooks.dataproc.DataprocResourceIsNotReadyError[source]¶
Bases:
airflow.exceptions.AirflowException
Raise when resource is not ready for create Dataproc cluster.
- class airflow.providers.google.cloud.hooks.dataproc.DataProcJobBuilder(project_id, task_id, cluster_name, job_type, properties=None)[source]¶
A helper class for building Dataproc job.
- add_labels(labels=None)[source]¶
Set labels for Dataproc job.
- Parameters
labels (dict | None) – Labels for the job query.
- add_variables(variables=None)[source]¶
Set variables for Dataproc job.
- Parameters
variables (dict | None) – Variables for the job query.
- add_query_uri(query_uri)[source]¶
Set query uri for Dataproc job.
- Parameters
query_uri (str) – URI for the job query.
- set_python_main(main)[source]¶
Set Dataproc main python file uri.
- Parameters
main (str) – URI for the python main file.
- class airflow.providers.google.cloud.hooks.dataproc.DataprocHook(gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
Bases:
airflow.providers.google.common.hooks.base_google.GoogleBaseHook
Google Cloud Dataproc APIs.
All the methods in the hook where project_id is used must be called with keyword arguments rather than positional.
- dataproc_options_to_args(options)[source]¶
Return a formatted cluster parameters from a dictionary of arguments.
- wait_for_operation(operation, timeout=None, result_retry=DEFAULT)[source]¶
Wait for a long-lasting operation to complete.
- create_cluster(region, project_id, cluster_name, cluster_config=None, virtual_cluster_config=None, labels=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Create a cluster in a specified project.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region in which to handle the request.
cluster_name (str) – Name of the cluster to create.
labels (dict[str, str] | None) – Labels that will be assigned to created cluster.
cluster_config (dict | google.cloud.dataproc_v1.Cluster | None) – The cluster config to create. If a dict is provided, it must be of the same form as the protobuf message
ClusterConfig
.virtual_cluster_config (dict | None) – The virtual cluster config, used when creating a Dataproc cluster that does not directly control the underlying compute resources, for example, when creating a Dataproc-on-GKE cluster with
VirtualClusterConfig
.request_id (str | None) – A unique id used to identify the request. If the server receives two CreateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- delete_cluster(region, cluster_name, project_id, cluster_uuid=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Delete a cluster in a project.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region in which to handle the request.
cluster_name (str) – Name of the cluster to delete.
cluster_uuid (str | None) – If specified, the RPC should fail if cluster with the UUID does not exist.
request_id (str | None) – A unique id used to identify the request. If the server receives two DeleteClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- diagnose_cluster(region, cluster_name, project_id, tarball_gcs_dir=None, diagnosis_interval=None, jobs=None, yarn_application_ids=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Get cluster diagnostic information.
After the operation completes, the response contains the Cloud Storage URI of the diagnostic output report containing a summary of collected diagnostics.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region in which to handle the request.
cluster_name (str) – Name of the cluster.
tarball_gcs_dir (str | None) – The output Cloud Storage directory for the diagnostic tarball. If not specified, a task-specific directory in the cluster’s staging bucket will be used.
diagnosis_interval (dict | google.type.interval_pb2.Interval | None) – Time interval in which diagnosis should be carried out on the cluster.
jobs (collections.abc.MutableSequence[str] | None) – Specifies a list of jobs on which diagnosis is to be performed. Format: projects/{project}/regions/{region}/jobs/{job}
yarn_application_ids (collections.abc.MutableSequence[str] | None) – Specifies a list of yarn applications on which diagnosis is to be performed.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- get_cluster(region, cluster_name, project_id, retry=DEFAULT, timeout=None, metadata=())[source]¶
Get the resource representation for a cluster in a project.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
cluster_name (str) – The cluster name.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- list_clusters(region, filter_, project_id, page_size=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
List all regions/{region}/clusters in a project.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
filter – To constrain the clusters to. Case-sensitive.
page_size (int | None) – The maximum number of resources contained in the underlying API response. If page streaming is performed per-resource, this parameter does not affect the return value. If page streaming is performed per-page, this determines the maximum number of resources in a page.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- update_cluster(cluster_name, cluster, update_mask, project_id, region, graceful_decommission_timeout=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Update a cluster in a project.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
cluster_name (str) – The cluster name.
cluster (dict | google.cloud.dataproc_v1.Cluster) – Changes to the cluster. If a dict is provided, it must be of the same form as the protobuf message
Cluster
.update_mask (dict | google.protobuf.field_mask_pb2.FieldMask) –
Specifies the path, relative to
Cluster
, of the field to update. For example, to change the number of workers in a cluster to 5, this would be specified asconfig.worker_config.num_instances
, and thePATCH
request body would specify the new value:{"config": {"workerConfig": {"numInstances": "5"}}}
Similarly, to change the number of preemptible workers in a cluster to 5, this would be
config.secondary_worker_config.num_instances
and thePATCH
request body would be:{"config": {"secondaryWorkerConfig": {"numInstances": "5"}}}
If a dict is provided, it must be of the same form as the protobuf message
FieldMask
.graceful_decommission_timeout (dict | google.protobuf.duration_pb2.Duration | None) –
Timeout for graceful YARN decommissioning. Graceful decommissioning allows removing nodes from the cluster without interrupting jobs in progress. Timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes (and potentially interrupting jobs). Default timeout is 0 (for forceful decommission), and the maximum allowed timeout is one day.
Only supported on Dataproc image versions 1.2 and higher.
If a dict is provided, it must be of the same form as the protobuf message
Duration
.request_id (str | None) – A unique id used to identify the request. If the server receives two UpdateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- start_cluster(region, project_id, cluster_name, cluster_uuid=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Start a cluster in a project.
- Parameters
region (str) – Cloud Dataproc region to handle the request.
project_id (str) – Google Cloud project ID that the cluster belongs to.
cluster_name (str) – The cluster name.
cluster_uuid (str | None) – The cluster UUID
request_id (str | None) – A unique id used to identify the request. If the server receives two UpdateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- Returns
An instance of
google.api_core.operation.Operation
- Return type
- stop_cluster(region, project_id, cluster_name, cluster_uuid=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Start a cluster in a project.
- Parameters
region (str) – Cloud Dataproc region to handle the request.
project_id (str) – Google Cloud project ID that the cluster belongs to.
cluster_name (str) – The cluster name.
cluster_uuid (str | None) – The cluster UUID
request_id (str | None) – A unique id used to identify the request. If the server receives two UpdateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- Returns
An instance of
google.api_core.operation.Operation
- Return type
- create_workflow_template(template, project_id, region, retry=DEFAULT, timeout=None, metadata=())[source]¶
Create a new workflow template.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
template (dict | google.cloud.dataproc_v1.WorkflowTemplate) – The Dataproc workflow template to create. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- instantiate_workflow_template(template_name, project_id, region, version=None, request_id=None, parameters=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Instantiate a template and begins execution.
- Parameters
template_name (str) – Name of template to instantiate.
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
version (int | None) – Version of workflow template to instantiate. If specified, the workflow will be instantiated only if the current version of the workflow template has the supplied version. This option cannot be used to instantiate a previous version of workflow template.
request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.
parameters (dict[str, str] | None) – Map from parameter names to values that should be used for those parameters. Values may not exceed 100 characters.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- instantiate_inline_workflow_template(template, project_id, region, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Instantiate a template and begin execution.
- Parameters
template (dict | google.cloud.dataproc_v1.WorkflowTemplate) – The workflow template to instantiate. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate.
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- wait_for_job(job_id, project_id, region, wait_time=10, timeout=None)[source]¶
Poll a job to check if it has finished.
- get_job(job_id, project_id, region, retry=DEFAULT, timeout=None, metadata=())[source]¶
Get the resource representation for a job in a project.
- Parameters
job_id (str) – Dataproc job ID.
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- submit_job(job, project_id, region, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Submit a job to a cluster.
- Parameters
job (dict | google.cloud.dataproc_v1.Job) – The job resource. If a dict is provided, it must be of the same form as the protobuf message Job.
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- cancel_job(job_id, project_id, region=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Start a job cancellation request.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str | None) – Cloud Dataproc region to handle the request.
job_id (str) – The job ID.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- create_batch(region, project_id, batch, batch_id=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Create a batch workload.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
batch (dict | google.cloud.dataproc_v1.Batch) – The batch to create.
batch_id (str | None) – The ID to use for the batch, which will become the final component of the batch’s resource name. This value must be of 4-63 characters. Valid characters are
[a-z][0-9]-
.request_id (str | None) – A unique id used to identify the request. If the server receives two CreateBatchRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- delete_batch(batch_id, region, project_id, retry=DEFAULT, timeout=None, metadata=())[source]¶
Delete the batch workload resource.
- Parameters
batch_id (str) – The batch ID.
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- get_batch(batch_id, region, project_id, retry=DEFAULT, timeout=None, metadata=())[source]¶
Get the batch workload resource representation.
- Parameters
batch_id (str) – The batch ID.
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- list_batches(region, project_id, page_size=None, page_token=None, retry=DEFAULT, timeout=None, metadata=(), filter=None, order_by=None)[source]¶
List batch workloads.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
page_size (int | None) – The maximum number of batches to return in each response. The service may return fewer than this value. The default page size is 20; the maximum page size is 1000.
page_token (str | None) – A page token received from a previous
ListBatches
call. Provide this token to retrieve the subsequent page.retry (google.api_core.retry.Retry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
filter (str | None) – Result filters as specified in ListBatchesRequest
order_by (str | None) – How to order results as specified in ListBatchesRequest
- wait_for_batch(batch_id, region, project_id, wait_check_interval=10, retry=DEFAULT, timeout=None, metadata=())[source]¶
Wait for a batch job to complete.
After submission of a batch job, the operator waits for the job to complete. This hook is, however, useful in the case when Airflow is restarted or the task pid is killed for any reason. In this case, the creation would happen again, catching the raised AlreadyExists, and fail to this function for waiting on completion.
- Parameters
batch_id (str) – The batch ID.
region (str) – Cloud Dataproc region to handle the request.
project_id (str) – Google Cloud project ID that the cluster belongs to.
wait_check_interval (int) – The amount of time to pause between checks for job completion.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- class airflow.providers.google.cloud.hooks.dataproc.DataprocAsyncHook(gcp_conn_id='google_cloud_default', impersonation_chain=None, **kwargs)[source]¶
Bases:
airflow.providers.google.common.hooks.base_google.GoogleBaseHook
Asynchronous interaction with Google Cloud Dataproc APIs.
All the methods in the hook where project_id is used must be called with keyword arguments rather than positional.
- async create_cluster(region, project_id, cluster_name, cluster_config=None, virtual_cluster_config=None, labels=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Create a cluster in a project.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region in which to handle the request.
cluster_name (str) – Name of the cluster to create.
labels (dict[str, str] | None) – Labels that will be assigned to created cluster.
cluster_config (dict | google.cloud.dataproc_v1.Cluster | None) – The cluster config to create. If a dict is provided, it must be of the same form as the protobuf message
ClusterConfig
.virtual_cluster_config (dict | None) – The virtual cluster config, used when creating a Dataproc cluster that does not directly control the underlying compute resources, for example, when creating a Dataproc-on-GKE cluster with
VirtualClusterConfig
.request_id (str | None) – A unique id used to identify the request. If the server receives two CreateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async delete_cluster(region, cluster_name, project_id, cluster_uuid=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Delete a cluster in a project.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region in which to handle the request.
cluster_name (str) – Name of the cluster to delete.
cluster_uuid (str | None) – If specified, the RPC should fail if cluster with the UUID does not exist.
request_id (str | None) – A unique id used to identify the request. If the server receives two DeleteClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async diagnose_cluster(region, cluster_name, project_id, tarball_gcs_dir=None, diagnosis_interval=None, jobs=None, yarn_application_ids=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Get cluster diagnostic information.
After the operation completes, the response contains the Cloud Storage URI of the diagnostic output report containing a summary of collected diagnostics.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region in which to handle the request.
cluster_name (str) – Name of the cluster.
tarball_gcs_dir (str | None) – The output Cloud Storage directory for the diagnostic tarball. If not specified, a task-specific directory in the cluster’s staging bucket will be used.
diagnosis_interval (dict | google.type.interval_pb2.Interval | None) – Time interval in which diagnosis should be carried out on the cluster.
jobs (collections.abc.MutableSequence[str] | None) – Specifies a list of jobs on which diagnosis is to be performed. Format: projects/{project}/regions/{region}/jobs/{job}
yarn_application_ids (collections.abc.MutableSequence[str] | None) – Specifies a list of yarn applications on which diagnosis is to be performed.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async get_cluster(region, cluster_name, project_id, retry=DEFAULT, timeout=None, metadata=())[source]¶
Get the resource representation for a cluster in a project.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
cluster_name (str) – The cluster name.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async list_clusters(region, filter_, project_id, page_size=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
List all regions/{region}/clusters in a project.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
filter – To constrain the clusters to. Case-sensitive.
page_size (int | None) – The maximum number of resources contained in the underlying API response. If page streaming is performed per-resource, this parameter does not affect the return value. If page streaming is performed per-page, this determines the maximum number of resources in a page.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async update_cluster(cluster_name, cluster, update_mask, project_id, region, graceful_decommission_timeout=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Update a cluster in a project.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
cluster_name (str) – The cluster name.
cluster (dict | google.cloud.dataproc_v1.Cluster) – Changes to the cluster. If a dict is provided, it must be of the same form as the protobuf message
Cluster
.update_mask (dict | google.protobuf.field_mask_pb2.FieldMask) –
Specifies the path, relative to
Cluster
, of the field to update. For example, to change the number of workers in a cluster to 5, this would be specified asconfig.worker_config.num_instances
, and thePATCH
request body would specify the new value:{"config": {"workerConfig": {"numInstances": "5"}}}
Similarly, to change the number of preemptible workers in a cluster to 5, this would be
config.secondary_worker_config.num_instances
and thePATCH
request body would be:{"config": {"secondaryWorkerConfig": {"numInstances": "5"}}}
If a dict is provided, it must be of the same form as the protobuf message
FieldMask
.graceful_decommission_timeout (dict | google.protobuf.duration_pb2.Duration | None) –
Timeout for graceful YARN decommissioning. Graceful decommissioning allows removing nodes from the cluster without interrupting jobs in progress. Timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes (and potentially interrupting jobs). Default timeout is 0 (for forceful decommission), and the maximum allowed timeout is one day.
Only supported on Dataproc image versions 1.2 and higher.
If a dict is provided, it must be of the same form as the protobuf message
Duration
.request_id (str | None) – A unique id used to identify the request. If the server receives two UpdateClusterRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async create_workflow_template(template, project_id, region, retry=DEFAULT, timeout=None, metadata=())[source]¶
Create a new workflow template.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
template (dict | google.cloud.dataproc_v1.WorkflowTemplate) – The Dataproc workflow template to create. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async instantiate_workflow_template(template_name, project_id, region, version=None, request_id=None, parameters=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Instantiate a template and begins execution.
- Parameters
template_name (str) – Name of template to instantiate.
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
version (int | None) – Version of workflow template to instantiate. If specified, the workflow will be instantiated only if the current version of the workflow template has the supplied version. This option cannot be used to instantiate a previous version of workflow template.
request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.
parameters (dict[str, str] | None) – Map from parameter names to values that should be used for those parameters. Values may not exceed 100 characters.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async instantiate_inline_workflow_template(template, project_id, region, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Instantiate a template and begin execution.
- Parameters
template (dict | google.cloud.dataproc_v1.WorkflowTemplate) – The workflow template to instantiate. If a dict is provided, it must be of the same form as the protobuf message WorkflowTemplate.
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async get_job(job_id, project_id, region, retry=DEFAULT, timeout=None, metadata=())[source]¶
Get the resource representation for a job in a project.
- Parameters
job_id (str) – Dataproc job ID.
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async submit_job(job, project_id, region, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Submit a job to a cluster.
- Parameters
job (dict | google.cloud.dataproc_v1.Job) – The job resource. If a dict is provided, it must be of the same form as the protobuf message Job.
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
request_id (str | None) – A tag that prevents multiple concurrent workflow instances with the same tag from running. This mitigates risk of concurrent instances started due to retries.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async cancel_job(job_id, project_id, region=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Start a job cancellation request.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str | None) – Cloud Dataproc region to handle the request.
job_id (str) – The job ID.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async create_batch(region, project_id, batch, batch_id=None, request_id=None, retry=DEFAULT, timeout=None, metadata=())[source]¶
Create a batch workload.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
batch (dict | google.cloud.dataproc_v1.Batch) – The batch to create.
batch_id (str | None) – The ID to use for the batch, which will become the final component of the batch’s resource name. This value must be of 4-63 characters. Valid characters are
[a-z][0-9]-
.request_id (str | None) – A unique id used to identify the request. If the server receives two CreateBatchRequest requests with the same ID, the second request will be ignored, and an operation created for the first one and stored in the backend is returned.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async delete_batch(batch_id, region, project_id, retry=DEFAULT, timeout=None, metadata=())[source]¶
Delete the batch workload resource.
- Parameters
batch_id (str) – The batch ID.
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async get_batch(batch_id, region, project_id, retry=DEFAULT, timeout=None, metadata=())[source]¶
Get the batch workload resource representation.
- Parameters
batch_id (str) – The batch ID.
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
- async list_batches(region, project_id, page_size=None, page_token=None, retry=DEFAULT, timeout=None, metadata=(), filter=None, order_by=None)[source]¶
List batch workloads.
- Parameters
project_id (str) – Google Cloud project ID that the cluster belongs to.
region (str) – Cloud Dataproc region to handle the request.
page_size (int | None) – The maximum number of batches to return in each response. The service may return fewer than this value. The default page size is 20; the maximum page size is 1000.
page_token (str | None) – A page token received from a previous
ListBatches
call. Provide this token to retrieve the subsequent page.retry (google.api_core.retry_async.AsyncRetry | google.api_core.gapic_v1.method._MethodDefault) – A retry object used to retry requests. If None, requests will not be retried.
timeout (float | None) – The amount of time, in seconds, to wait for the request to complete. If retry is specified, the timeout applies to each individual attempt.
metadata (collections.abc.Sequence[tuple[str, str]]) – Additional metadata that is provided to the method.
filter (str | None) – Result filters as specified in ListBatchesRequest
order_by (str | None) – How to order results as specified in ListBatchesRequest