Usage Guide¶
The Informatica provider enables automatic lineage tracking for Airflow tasks that define inlets and outlets.
How It Works¶
The Informatica plugin automatically detects tasks with lineage support and sends inlet/outlet information to Informatica EDC when tasks succeed. No additional configuration is required beyond defining inlets and outlets in your tasks.
Key Features¶
Automatic Lineage Detection: Plugin automatically detects tasks with lineage support
EDC Integration: Native REST API integration with Informatica Enterprise Data Catalog
Transparent Operation: No code changes required beyond inlet/outlet definitions
Error Handling: Robust error handling for API failures and invalid objects
Configurable: Extensive configuration options for different environments
Architecture¶
The provider consists of several key components:
- Hooks
InformaticaEDCHookprovides low-level EDC API access for authentication, object retrieval, and lineage creation.- Extractors
InformaticaLineageExtractorhandles lineage data extraction and conversion to Airflow-compatible formats.- Plugins
InformaticaProviderPluginregisters listeners that monitor task lifecycle events and trigger lineage operations.- Listeners
Event-driven listeners that respond to task success/failure events and process lineage information.
Requirements¶
Apache Airflow 3.0+
Access to Informatica Enterprise Data Catalog instance
Valid EDC credentials with API access permissions
Quick Start¶
Install the provider:
pip install apache-airflow-providers-informatica
Configure connection:
Create an HTTP connection in Airflow UI with EDC server details and security domain in extras.
Add lineage to tasks:
Define inlets and outlets in your tasks using EDC object URIs.
Run your DAG:
The provider automatically handles lineage extraction when tasks succeed.
Example DAG¶
from airflow import DAG
from airflow.providers.standard.operators.python import PythonOperator
from datetime import datetime
def my_python_task(**kwargs):
print("Hello Informatica Lineage!")
with DAG(
dag_id="example_informatica_lineage_dag",
start_date=datetime(2024, 1, 1),
schedule=None,
catchup=False,
) as dag:
python_task = PythonOperator(
task_id="my_python_task",
python_callable=my_python_task,
inlets=[{"dataset_uri": "edc://object/source_table_abc123"}],
outlets=[{"dataset_uri": "edc://object/target_table_xyz789"}],
)
When this task succeeds, the provider automatically creates a lineage link between the source and target objects in EDC.
Hooks¶
InformaticaEDCHook¶
The hook provides low-level access to Informatica EDC API.
from airflow.providers.informatica.hooks.edc import InformaticaEDCHook
hook = InformaticaEDCHook(informatica_edc_conn_id="my_connection")
object_data = hook.get_object("edc://object/table_123")
result = hook.create_lineage_link("source_id", "target_id")
Plugins and Listeners¶
The InformaticaProviderPlugin automatically registers listeners that:
Monitor task success events
Extract inlet/outlet information from tasks
Resolve object IDs using EDC API
Create lineage links between resolved objects
No manual intervention is required. The plugin works transparently with any task that defines inlets and outlets.
Supported Inlet/Outlet Formats¶
Inlets and outlets can be defined as:
String URIs:
"edc://object/table_name"Dictionary with dataset_uri:
{"dataset_uri": "edc://object/table_name"}
The plugin automatically handles both formats and resolves them to EDC object IDs.
Support¶
Documentation: See the guides section for detailed usage and configuration
Issues: Report bugs on the Apache Airflow GitHub repository
Community: Join the Airflow community for discussions and support
License¶
Licensed under the Apache License, Version 2.0. See LICENSE file for details.