airflow.providers.google.cloud.openlineage.utils¶
Attributes¶
Classes¶
Facet that represents relevant statistics of bigquery run. |
Functions¶
|
Merge multiple column lineage facets into a single consolidated facet. |
Extract and process the dataset name from a given path. |
|
|
Get facets from BigQuery table object. |
|
|
|
Get column lineage facet for identity transformations. |
|
Get object from nested structure of objects, where it's not guaranteed that all keys in the nested structure exist. |
Inject OpenLineage properties into Spark job definition. |
|
Inject OpenLineage properties into Dataproc batch definition. |
|
|
Inject OpenLineage properties into all Spark jobs within Workflow Template. |
Module Contents¶
- airflow.providers.google.cloud.openlineage.utils.merge_column_lineage_facets(facets)[source]¶
Merge multiple column lineage facets into a single consolidated facet.
Specifically, it aggregates input fields and transformations for each field across all provided facets.
- Args:
facets: Column Lineage Facets to be merged.
- Returns:
A new Column Lineage Facet containing all fields, their respective input fields and transformations.
- Notes:
Input fields are uniquely identified by their (namespace, name, field) tuple.
If multiple facets contain the same field with the same input field, those input fields are merged without duplication.
Transformations associated with input fields are also merged. If transformations are not supported by the version of the InputField class, they will be omitted.
Transformation merging relies on a composite key of the field name and input field tuple to track and consolidate transformations.
- Examples:
Case 1: Two facets with the same input field
` >>> facet1 = ColumnLineageDatasetFacet( ... fields={"columnA": Fields(inputFields=[InputField("namespace1", "dataset1", "field1")])} ... ) >>> facet2 = ColumnLineageDatasetFacet( ... fields={"columnA": Fields(inputFields=[InputField("namespace1", "dataset1", "field1")])} ... ) >>> merged = merge_column_lineage_facets([facet1, facet2]) >>> merged.fields["columnA"].inputFields [InputField("namespace1", "dataset1", "field1")] `
Case 2: Two facets with different transformations for the same input field
` >>> facet1 = ColumnLineageDatasetFacet( ... fields={ ... "columnA": Fields( ... inputFields=[InputField("namespace1", "dataset1", "field1", transformations=["t1"])] ... ) ... } ... ) >>> facet2 = ColumnLineageDatasetFacet( ... fields={ ... "columnA": Fields( ... inputFields=[InputField("namespace1", "dataset1", "field1", transformations=["t2"])] ... ) ... } ... ) >>> merged = merge_column_lineage_facets([facet1, facet2]) >>> merged.fields["columnA"].inputFields[0].transformations ["t1", "t2"] `
- airflow.providers.google.cloud.openlineage.utils.extract_ds_name_from_gcs_path(path)[source]¶
Extract and process the dataset name from a given path.
- Args:
path: The path to process e.g. of a gcs file.
- Returns:
The processed dataset name.
- Examples:
>>> extract_ds_name_from_gcs_path("/dir/file.*") 'dir' >>> extract_ds_name_from_gcs_path("/dir/pre_") 'dir' >>> extract_ds_name_from_gcs_path("/dir/file.txt") 'dir/file.txt' >>> extract_ds_name_from_gcs_path("/dir/file.") 'dir' >>> extract_ds_name_from_gcs_path("/dir/") 'dir' >>> extract_ds_name_from_gcs_path("") '/' >>> extract_ds_name_from_gcs_path("/") '/' >>> extract_ds_name_from_gcs_path(".") '/'
- airflow.providers.google.cloud.openlineage.utils.get_facets_from_bq_table(table)[source]¶
Get facets from BigQuery table object.
- airflow.providers.google.cloud.openlineage.utils.get_namespace_name_from_source_uris(source_uris)[source]¶
- airflow.providers.google.cloud.openlineage.utils.get_identity_column_lineage_facet(dest_field_names, input_datasets)[source]¶
Get column lineage facet for identity transformations.
This function generates a simple column lineage facet, where each destination column consists of source columns of the same name from all input datasets that have that column. The lineage assumes there are no transformations applied, meaning the columns retain their identity between the source and destination datasets.
- Args:
dest_field_names: A list of destination column names for which lineage should be determined. input_datasets: A list of input datasets with schema facets.
- Returns:
- A dictionary containing a single key, columnLineage, mapped to a ColumnLineageDatasetFacet.
If no column lineage can be determined, an empty dictionary is returned - see Notes below.
- Notes:
If any input dataset lacks a schema facet, the function immediately returns an empty dictionary.
If any field in the source dataset’s schema is not present in the destination table, the function returns an empty dictionary. The destination table can contain extra fields, but all source columns should be present in the destination table.
If none of the destination columns can be matched to input dataset columns, an empty dictionary is returned.
Extra columns in the destination table that do not exist in the input datasets are ignored and skipped in the lineage facet, as they cannot be traced back to a source column.
The function assumes there are no transformations applied, meaning the columns retain their identity between the source and destination datasets.
- class airflow.providers.google.cloud.openlineage.utils.BigQueryJobRunFacet[source]¶
Bases:
airflow.providers.common.compat.openlineage.facet.RunFacet
Facet that represents relevant statistics of bigquery run.
This facet is used to provide statistics about bigquery run.
- Parameters:
cached – BigQuery caches query results. Rest of the statistics will not be provided for cached queries.
billedBytes – How many bytes BigQuery bills for.
properties – Full property tree of BigQUery run.
- airflow.providers.google.cloud.openlineage.utils.get_from_nullable_chain(source, chain)[source]¶
Get object from nested structure of objects, where it’s not guaranteed that all keys in the nested structure exist.
Intended to replace chain of dict.get() statements.
Example usage:
if ( not job._properties.get("statistics") or not job._properties.get("statistics").get("query") or not job._properties.get("statistics").get("query").get("referencedTables") ): return None result = job._properties.get("statistics").get("query").get("referencedTables")
becomes:
result = get_from_nullable_chain(properties, ["statistics", "query", "queryPlan"]) if not result: return None
- airflow.providers.google.cloud.openlineage.utils.inject_openlineage_properties_into_dataproc_job(job, context, inject_parent_job_info, inject_transport_info)[source]¶
Inject OpenLineage properties into Spark job definition.
- This function does not remove existing configurations or modify the job definition in any way,
except to add the required OpenLineage properties if they are not already present.
- The entire properties injection process will be skipped if any condition is met:
The OpenLineage provider is not accessible.
The job type is unsupported.
Both inject_parent_job_info and inject_transport_info are set to False.
Additionally, specific information will not be injected if relevant OpenLineage properties already exist.
- Parent job information will not be injected if:
Any property prefixed with spark.openlineage.parent exists.
inject_parent_job_info is False.
- Transport information will not be injected if:
Any property prefixed with spark.openlineage.transport exists.
inject_transport_info is False.
- Args:
job: The original Dataproc job definition. context: The Airflow context in which the job is running. inject_parent_job_info: Flag indicating whether to inject parent job information. inject_transport_info: Flag indicating whether to inject transport information.
- Returns:
The modified job definition with OpenLineage properties injected, if applicable.
- airflow.providers.google.cloud.openlineage.utils.inject_openlineage_properties_into_dataproc_batch(batch, context, inject_parent_job_info, inject_transport_info)[source]¶
Inject OpenLineage properties into Dataproc batch definition.
- This function does not remove existing configurations or modify the batch definition in any way,
except to add the required OpenLineage properties if they are not already present.
- The entire properties injection process will be skipped if any condition is met:
The OpenLineage provider is not accessible.
The batch type is unsupported.
Both inject_parent_job_info and inject_transport_info are set to False.
Additionally, specific information will not be injected if relevant OpenLineage properties already exist.
- Parent job information will not be injected if:
Any property prefixed with spark.openlineage.parent exists.
inject_parent_job_info is False.
- Transport information will not be injected if:
Any property prefixed with spark.openlineage.transport exists.
inject_transport_info is False.
- Args:
batch: The original Dataproc batch definition. context: The Airflow context in which the job is running. inject_parent_job_info: Flag indicating whether to inject parent job information. inject_transport_info: Flag indicating whether to inject transport information.
- Returns:
The modified batch definition with OpenLineage properties injected, if applicable.
- airflow.providers.google.cloud.openlineage.utils.inject_openlineage_properties_into_dataproc_workflow_template(template, context, inject_parent_job_info, inject_transport_info)[source]¶
Inject OpenLineage properties into all Spark jobs within Workflow Template.
- This function does not remove existing configurations or modify the jobs definition in any way,
except to add the required OpenLineage properties if they are not already present.
- The entire properties injection process for each job will be skipped if any condition is met:
The OpenLineage provider is not accessible.
The job type is unsupported.
Both inject_parent_job_info and inject_transport_info are set to False.
Additionally, specific information will not be injected if relevant OpenLineage properties already exist.
- Parent job information will not be injected if:
Any property prefixed with spark.openlineage.parent exists.
inject_parent_job_info is False.
- Transport information will not be injected if:
Any property prefixed with spark.openlineage.transport exists.
inject_transport_info is False.
- Args:
template: The original Dataproc Workflow Template definition. context: The Airflow context in which the job is running. inject_parent_job_info: Flag indicating whether to inject parent job information. inject_transport_info: Flag indicating whether to inject transport information.
- Returns:
The modified Workflow Template definition with OpenLineage properties injected, if applicable.