airflow.providers.google.cloud.transfers.gcs_to_gcs

This module contains a Google Cloud Storage operator.

Module Contents

Classes

GCSToGCSOperator

Copies objects from a bucket to another, with renaming if requested.

Attributes

WILDCARD

airflow.providers.google.cloud.transfers.gcs_to_gcs.WILDCARD = '*'[source]
class airflow.providers.google.cloud.transfers.gcs_to_gcs.GCSToGCSOperator(*, source_bucket, source_object=None, source_objects=None, destination_bucket=None, destination_object=None, delimiter=None, move_object=False, replace=True, gcp_conn_id='google_cloud_default', last_modified_time=None, maximum_modified_time=None, is_older_than=None, impersonation_chain=None, source_object_required=False, exact_match=False, match_glob=None, **kwargs)[source]

Bases: airflow.models.BaseOperator

Copies objects from a bucket to another, with renaming if requested.

See also

For more information on how to use this operator, take a look at the guide: GCSToGCSOperator

Parameters
  • source_bucket – The source Google Cloud Storage bucket where the object is. (templated)

  • source_object – The source name of the object to copy in the Google cloud storage bucket. (templated) You can use only one wildcard for objects (filenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported.

  • source_objects – A list of source name of the objects to copy in the Google cloud storage bucket. (templated)

  • destination_bucket – The destination Google Cloud Storage bucket where the object should be. If the destination_bucket is None, it defaults to source_bucket. (templated)

  • destination_object – The destination name of the object in the destination Google Cloud Storage bucket. (templated) If a wildcard is supplied in the source_object argument, this is the prefix that will be prepended to the final destination objects’ paths. Note that the source path’s part before the wildcard will be removed; if it needs to be retained it should be appended to destination_object. For example, with prefix foo/* and destination_object blah/, the file foo/baz will be copied to blah/baz; to retain the prefix write the destination_object as e.g. blah/foo, in which case the copied file will be named blah/foo/baz. The same thing applies to source objects inside source_objects.

  • move_object – When move object is True, the object is moved instead of copied to the new location. This is the equivalent of a mv command as opposed to a cp command.

  • replace – Whether you want to replace existing destination files or not.

  • delimiter – (Deprecated) This is used to restrict the result to only the ‘files’ in a given ‘folder’. If source_objects = [‘foo/bah/’] and delimiter = ‘.avro’, then only the ‘files’ in the folder ‘foo/bah/’ with ‘.avro’ delimiter will be copied to the destination object.

  • gcp_conn_id – (Optional) The connection ID used to connect to Google Cloud.

  • last_modified_time – When specified, the objects will be copied or moved, only if they were modified after last_modified_time. If tzinfo has not been set, UTC will be assumed.

  • maximum_modified_time – When specified, the objects will be copied or moved, only if they were modified before maximum_modified_time. If tzinfo has not been set, UTC will be assumed.

  • is_older_than – When specified, the objects will be copied if they are older than the specified time in seconds.

  • impersonation_chain (str | collections.abc.Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

  • source_object_required – Whether you want to raise an exception when the source object doesn’t exist. It doesn’t have any effect when the source objects are folders or patterns.

  • exact_match – When specified, only exact match of the source object (filename) will be copied.

  • match_glob (str | None) – (Optional) filters objects based on the glob pattern given by the string ( e.g, '**/*/.json')

Example

The following Operator would copy a single file named sales/sales-2017/january.avro in the data bucket to the file named copied_sales/2017/january-backup.avro in the data_backup bucket

copy_single_file = GCSToGCSOperator(
    task_id="copy_single_file",
    source_bucket="data",
    source_objects=["sales/sales-2017/january.avro"],
    destination_bucket="data_backup",
    destination_object="copied_sales/2017/january-backup.avro",
    exact_match=True,
    gcp_conn_id=google_cloud_conn_id,
)

The following Operator would copy all the Avro files from sales/sales-2017 folder (i.e. all files with names starting with that prefix) in data bucket to the copied_sales/2017 folder in the data_backup bucket.

copy_files = GCSToGCSOperator(
    task_id='copy_files',
    source_bucket='data',
    source_objects=['sales/sales-2017'],
    destination_bucket='data_backup',
    destination_object='copied_sales/2017/',
    match_glob='**/*.avro'
    gcp_conn_id=google_cloud_conn_id
)

Or ::

copy_files = GCSToGCSOperator(
    task_id='copy_files',
    source_bucket='data',
    source_object='sales/sales-2017/*.avro',
    destination_bucket='data_backup',
    destination_object='copied_sales/2017/',
    gcp_conn_id=google_cloud_conn_id
)

The following Operator would move all the Avro files from sales/sales-2017 folder (i.e. all files with names starting with that prefix) in data bucket to the same folder in the data_backup bucket, deleting the original files in the process.

move_files = GCSToGCSOperator(
    task_id="move_files",
    source_bucket="data",
    source_object="sales/sales-2017/*.avro",
    destination_bucket="data_backup",
    move_object=True,
    gcp_conn_id=google_cloud_conn_id,
)
The following Operator would move all the Avro files from sales/sales-2019

and sales/sales-2020 folder in data bucket to the same folder in the data_backup bucket, deleting the original files in the process.

move_files = GCSToGCSOperator(
    task_id="move_files",
    source_bucket="data",
    source_objects=["sales/sales-2019/*.avro", "sales/sales-2020"],
    destination_bucket="data_backup",
    delimiter=".avro",
    move_object=True,
    gcp_conn_id=google_cloud_conn_id,
)
template_fields: collections.abc.Sequence[str] = ('source_bucket', 'source_object', 'source_objects', 'destination_bucket', 'destination_object',...[source]
ui_color = '#f0eee4'[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

get_openlineage_facets_on_complete(task_instance)[source]

Implement _on_complete because execute method does preprocessing on internals.

This means we won’t have to normalize self.source_object and self.source_objects, destination bucket and so on.

Was this entry helpful?