airflow.providers.amazon.aws.operators.athena
¶
Module Contents¶
Classes¶
An operator that submits a Trino/Presto query to Amazon Athena. |
- class airflow.providers.amazon.aws.operators.athena.AthenaOperator(*, query, database, output_location=None, client_request_token=None, workgroup='primary', query_execution_context=None, result_configuration=None, sleep_time=30, max_polling_attempts=None, log_query=True, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), catalog='AwsDataCatalog', **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.operators.base_aws.AwsBaseOperator
[airflow.providers.amazon.aws.hooks.athena.AthenaHook
]An operator that submits a Trino/Presto query to Amazon Athena.
Note
if the task is killed while it runs, it’ll cancel the athena query that was launched, EXCEPT if running in deferrable mode.
See also
For more information on how to use this operator, take a look at the guide: Run a query in Amazon Athena
- Parameters
query (str) – Trino/Presto query to be run on Amazon Athena. (templated)
database (str) – Database to select. (templated)
catalog (str) – Catalog to select. (templated)
output_location (str | None) – s3 path to write the query results into. (templated) To run the query, you must specify the query results location using one of the ways: either for individual queries using either this setting (client-side), or in the workgroup, using WorkGroupConfiguration. If none of them is set, Athena issues an error that no output location is provided
client_request_token (str | None) – Unique token created by user to avoid multiple executions of same query
workgroup (str) – Athena workgroup in which query will be run. (templated)
query_execution_context (dict[str, str] | None) – Context in which query need to be run
result_configuration (dict[str, Any] | None) – Dict with path to store results in and config related to encryption
sleep_time (int) – Time (in seconds) to wait between two consecutive calls to check query status on Athena
max_polling_attempts (int | None) – Number of times to poll for query state before function exits To limit task execution time, use execution_timeout.
log_query (bool) – Whether to log athena query and other execution params when it’s executed. Defaults to True.
aws_conn_id – The Airflow connection used for AWS credentials. If this is
None
or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).region_name – AWS region_name. If not specified then the default boto3 behaviour is used.
verify – Whether or not to verify SSL certificates. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html
botocore_config – Configuration dictionary (key-values) for botocore client. See: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html
- template_fields: collections.abc.Sequence[str][source]¶
- template_ext: collections.abc.Sequence[str] = ('.sql',)[source]¶
- get_openlineage_facets_on_complete(_)[source]¶
Retrieve OpenLineage data by parsing SQL queries and enriching them with Athena API.
In addition to CTAS query, query and calculation results are stored in S3 location. For that reason additional output is attached with this location. Instead of using the complete path where the results are saved (user’s prefix + some UUID), we are creating a dataset with the user-provided path only. This should make it easier to match this dataset across different processes.