airflow.providers.amazon.aws.operators.athena

Module Contents

Classes

AthenaOperator

An operator that submits a Trino/Presto query to Amazon Athena.

class airflow.providers.amazon.aws.operators.athena.AthenaOperator(*, query, database, output_location=None, client_request_token=None, workgroup='primary', query_execution_context=None, result_configuration=None, sleep_time=30, max_polling_attempts=None, log_query=True, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), catalog='AwsDataCatalog', **kwargs)[source]

Bases: airflow.providers.amazon.aws.operators.base_aws.AwsBaseOperator[airflow.providers.amazon.aws.hooks.athena.AthenaHook]

An operator that submits a Trino/Presto query to Amazon Athena.

Note

if the task is killed while it runs, it’ll cancel the athena query that was launched, EXCEPT if running in deferrable mode.

See also

For more information on how to use this operator, take a look at the guide: Run a query in Amazon Athena

Parameters
  • query (str) – Trino/Presto query to be run on Amazon Athena. (templated)

  • database (str) – Database to select. (templated)

  • catalog (str) – Catalog to select. (templated)

  • output_location (str | None) – s3 path to write the query results into. (templated) To run the query, you must specify the query results location using one of the ways: either for individual queries using either this setting (client-side), or in the workgroup, using WorkGroupConfiguration. If none of them is set, Athena issues an error that no output location is provided

  • client_request_token (str | None) – Unique token created by user to avoid multiple executions of same query

  • workgroup (str) – Athena workgroup in which query will be run. (templated)

  • query_execution_context (dict[str, str] | None) – Context in which query need to be run

  • result_configuration (dict[str, Any] | None) – Dict with path to store results in and config related to encryption

  • sleep_time (int) – Time (in seconds) to wait between two consecutive calls to check query status on Athena

  • max_polling_attempts (int | None) – Number of times to poll for query state before function exits To limit task execution time, use execution_timeout.

  • log_query (bool) – Whether to log athena query and other execution params when it’s executed. Defaults to True.

  • aws_conn_id – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).

  • region_name – AWS region_name. If not specified then the default boto3 behaviour is used.

  • verify – Whether or not to verify SSL certificates. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html

  • botocore_config – Configuration dictionary (key-values) for botocore client. See: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html

aws_hook_class[source]
ui_color = '#44b5e2'[source]
template_fields: collections.abc.Sequence[str][source]
template_ext: collections.abc.Sequence[str] = ('.sql',)[source]
template_fields_renderers[source]
execute(context)[source]

Run Trino/Presto Query on Amazon Athena.

execute_complete(context, event=None)[source]
on_kill()[source]

Cancel the submitted Amazon Athena query.

get_openlineage_facets_on_complete(_)[source]

Retrieve OpenLineage data by parsing SQL queries and enriching them with Athena API.

In addition to CTAS query, query and calculation results are stored in S3 location. For that reason additional output is attached with this location. Instead of using the complete path where the results are saved (user’s prefix + some UUID), we are creating a dataset with the user-provided path only. This should make it easier to match this dataset across different processes.

get_openlineage_dataset(database, table)[source]

Was this entry helpful?