airflow.providers.amazon.aws.hooks.glue
¶
Module Contents¶
Classes¶
Interact with AWS Glue. |
|
Interact with AWS Glue Data Quality. |
Attributes¶
- class airflow.providers.amazon.aws.hooks.glue.GlueJobHook(s3_bucket=None, job_name=None, desc=None, concurrent_run_limit=1, script_location=None, retry_limit=0, num_of_dpus=None, iam_role_name=None, iam_role_arn=None, create_job_kwargs=None, update_config=False, job_poll_interval=6, *args, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook
Interact with AWS Glue.
Provide thick wrapper around
boto3.client("glue")
.- Parameters
s3_bucket (str | None) – S3 bucket where logs and local etl script will be uploaded
job_name (str | None) – unique job name per AWS account
desc (str | None) – job description
concurrent_run_limit (int) – The maximum number of concurrent runs allowed for a job
script_location (str | None) – path to etl script on s3
retry_limit (int) – Maximum number of times to retry this job if it fails
num_of_dpus (int | float | None) – Number of AWS Glue DPUs to allocate to this Job
region_name – aws region name (example: us-east-1)
iam_role_name (str | None) – AWS IAM Role for Glue Job Execution. If set iam_role_arn must equal None.
iam_role_arn (str | None) – AWS IAM Role ARN for Glue Job Execution, If set iam_role_name must equal None.
create_job_kwargs (dict | None) – Extra arguments for Glue Job Creation
update_config (bool) – Update job configuration on Glue (default: False)
Additional arguments (such as
aws_conn_id
) may be specified and are passed down to the underlying AwsBaseHook.- class LogContinuationTokens[source]¶
Used to hold the continuation tokens when reading logs from both streams Glue Jobs write to.
- initialize_job(script_arguments=None, run_kwargs=None)[source]¶
Initialize connection with AWS Glue to run job.
See also
- get_job_state(job_name, run_id)[source]¶
Get state of the Glue job; the job state can be running, finished, failed, stopped or timeout.
See also
- async async_get_job_state(job_name, run_id)[source]¶
Get state of the Glue job; the job state can be running, finished, failed, stopped or timeout.
The async version of get_job_state.
- print_job_logs(job_name, run_id, continuation_tokens)[source]¶
Print the latest job logs to the Airflow task log and updates the continuation tokens.
- Parameters
continuation_tokens (LogContinuationTokens) – the tokens where to resume from when reading logs. The object gets updated with the new tokens by this method.
- job_completion(job_name, run_id, verbose=False, sleep_before_return=0)[source]¶
Wait until Glue job with job_name finishes; return final state if finished or raises AirflowException.
- Parameters
- Returns
Dict of JobRunState and JobRunId
- Return type
- async async_job_completion(job_name, run_id, verbose=False)[source]¶
Wait until Glue job with job_name finishes; return final state if finished or raises AirflowException.
- has_job(job_name)[source]¶
Check if the job already exists.
See also
- Parameters
job_name – unique job name per AWS account
- Returns
Returns True if the job already exists and False if not.
- Return type
- update_job(**job_kwargs)[source]¶
Update job configurations.
See also
- Parameters
job_kwargs – Keyword args that define the configurations used for the job
- Returns
True if job was updated and false otherwise
- Return type
- class airflow.providers.amazon.aws.hooks.glue.GlueDataQualityHook(*args, **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.hooks.base_aws.AwsBaseHook
Interact with AWS Glue Data Quality.
Provide thick wrapper around
boto3.client("glue")
.Additional arguments (such as
aws_conn_id
) may be specified and are passed down to the underlying AwsBaseHook.- validate_evaluation_run_results(evaluation_run_id, show_results=True, verify_result_status=True)[source]¶
- log_recommendation_results(run_id)[source]¶
Print the outcome of recommendation run, recommendation run generates multiple rules against a data source (Glue table) in Data Quality Definition Language (DQDL) format.
Rules = [ IsComplete “NAME”, ColumnLength “EMP_ID” between 1 and 12, IsUnique “EMP_ID”, ColumnValues “INCOME” > 50000 ]