airflow.providers.amazon.aws.hooks.glue

Module Contents

Classes

GlueJobHook

Interact with AWS Glue.

GlueDataQualityHook

Interact with AWS Glue Data Quality.

Attributes

DEFAULT_LOG_SUFFIX

ERROR_LOG_SUFFIX

airflow.providers.amazon.aws.hooks.glue.DEFAULT_LOG_SUFFIX = 'output'[source]
airflow.providers.amazon.aws.hooks.glue.ERROR_LOG_SUFFIX = 'error'[source]
class airflow.providers.amazon.aws.hooks.glue.GlueJobHook(s3_bucket=None, job_name=None, desc=None, concurrent_run_limit=1, script_location=None, retry_limit=0, num_of_dpus=None, iam_role_name=None, iam_role_arn=None, create_job_kwargs=None, update_config=False, job_poll_interval=6, *args, **kwargs)[source]

Bases: airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook

Interact with AWS Glue.

Provide thick wrapper around boto3.client("glue").

Parameters
  • s3_bucket (str | None) – S3 bucket where logs and local etl script will be uploaded

  • job_name (str | None) – unique job name per AWS account

  • desc (str | None) – job description

  • concurrent_run_limit (int) – The maximum number of concurrent runs allowed for a job

  • script_location (str | None) – path to etl script on s3

  • retry_limit (int) – Maximum number of times to retry this job if it fails

  • num_of_dpus (int | float | None) – Number of AWS Glue DPUs to allocate to this Job

  • region_name – aws region name (example: us-east-1)

  • iam_role_name (str | None) – AWS IAM Role for Glue Job Execution. If set iam_role_arn must equal None.

  • iam_role_arn (str | None) – AWS IAM Role ARN for Glue Job Execution, If set iam_role_name must equal None.

  • create_job_kwargs (dict | None) – Extra arguments for Glue Job Creation

  • update_config (bool) – Update job configuration on Glue (default: False)

Additional arguments (such as aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.

class LogContinuationTokens[source]

Used to hold the continuation tokens when reading logs from both streams Glue Jobs write to.

create_glue_job_config()[source]
list_jobs()[source]

Get list of Jobs.

get_iam_execution_role()[source]
initialize_job(script_arguments=None, run_kwargs=None)[source]

Initialize connection with AWS Glue to run job.

get_job_state(job_name, run_id)[source]

Get state of the Glue job; the job state can be running, finished, failed, stopped or timeout.

Parameters
  • job_name (str) – unique job name per AWS account

  • run_id (str) – The job-run ID of the predecessor job run

Returns

State of the Glue job

Return type

str

async async_get_job_state(job_name, run_id)[source]

Get state of the Glue job; the job state can be running, finished, failed, stopped or timeout.

The async version of get_job_state.

logs_hook()[source]

Returns an AwsLogsHook instantiated with the parameters of the GlueJobHook.

print_job_logs(job_name, run_id, continuation_tokens)[source]

Print the latest job logs to the Airflow task log and updates the continuation tokens.

Parameters

continuation_tokens (LogContinuationTokens) – the tokens where to resume from when reading logs. The object gets updated with the new tokens by this method.

job_completion(job_name, run_id, verbose=False, sleep_before_return=0)[source]

Wait until Glue job with job_name finishes; return final state if finished or raises AirflowException.

Parameters
  • job_name (str) – unique job name per AWS account

  • run_id (str) – The job-run ID of the predecessor job run

  • verbose (bool) – If True, more Glue Job Run logs show in the Airflow Task Logs. (default: False)

  • sleep_before_return (int) – time in seconds to wait before returning final status.

Returns

Dict of JobRunState and JobRunId

Return type

dict[str, str]

async async_job_completion(job_name, run_id, verbose=False)[source]

Wait until Glue job with job_name finishes; return final state if finished or raises AirflowException.

Parameters
  • job_name (str) – unique job name per AWS account

  • run_id (str) – The job-run ID of the predecessor job run

  • verbose (bool) – If True, more Glue Job Run logs show in the Airflow Task Logs. (default: False)

Returns

Dict of JobRunState and JobRunId

Return type

dict[str, str]

has_job(job_name)[source]

Check if the job already exists.

Parameters

job_name – unique job name per AWS account

Returns

Returns True if the job already exists and False if not.

Return type

bool

update_job(**job_kwargs)[source]

Update job configurations.

Parameters

job_kwargs – Keyword args that define the configurations used for the job

Returns

True if job was updated and false otherwise

Return type

bool

get_or_create_glue_job()[source]

Get (or creates) and returns the Job name.

:return:Name of the Job

create_or_update_glue_job()[source]

Create (or update) and return the Job name.

:return:Name of the Job

class airflow.providers.amazon.aws.hooks.glue.GlueDataQualityHook(*args, **kwargs)[source]

Bases: airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook

Interact with AWS Glue Data Quality.

Provide thick wrapper around boto3.client("glue").

Additional arguments (such as aws_conn_id) may be specified and are passed down to the underlying AwsBaseHook.

has_data_quality_ruleset(name)[source]
get_evaluation_run_results(run_id)[source]
validate_evaluation_run_results(evaluation_run_id, show_results=True, verify_result_status=True)[source]
log_recommendation_results(run_id)[source]

Print the outcome of recommendation run, recommendation run generates multiple rules against a data source (Glue table) in Data Quality Definition Language (DQDL) format.

Rules = [ IsComplete “NAME”, ColumnLength “EMP_ID” between 1 and 12, IsUnique “EMP_ID”, ColumnValues “INCOME” > 50000 ]

Was this entry helpful?