airflow.providers.amazon.aws.operators.glue
¶
Module Contents¶
Classes¶
Create an AWS Glue Job. |
|
Creates a data quality ruleset with DQDL rules applied to a specified Glue table. |
|
Evaluate a ruleset against a data source (Glue table). |
|
Starts a recommendation run that is used to generate rules, Glue Data Quality analyzes the data and comes up with recommendations for a potential ruleset. |
- class airflow.providers.amazon.aws.operators.glue.GlueJobOperator(*, job_name='aws_glue_default_job', job_desc='AWS Glue Job with Airflow', script_location=None, concurrent_run_limit=None, script_args=None, retry_limit=0, num_of_dpus=None, aws_conn_id='aws_default', region_name=None, s3_bucket=None, iam_role_name=None, iam_role_arn=None, create_job_kwargs=None, run_job_kwargs=None, wait_for_completion=True, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), verbose=False, replace_script_file=False, update_config=False, job_poll_interval=6, stop_job_run_on_kill=False, sleep_before_return=0, **kwargs)[source]¶
Bases:
airflow.models.BaseOperator
Create an AWS Glue Job.
AWS Glue is a serverless Spark ETL service for running Spark Jobs on the AWS cloud. Language support: Python and Scala.
See also
For more information on how to use this operator, take a look at the guide: Submit an AWS Glue job
- Parameters
job_name (str) – unique job name per AWS Account
script_location (str | None) – location of ETL script. Must be a local or S3 path
job_desc (str) – job description details
concurrent_run_limit (int | None) – The maximum number of concurrent runs allowed for a job
script_args (dict | None) – etl script arguments and AWS Glue arguments (templated)
retry_limit (int) – The maximum number of times to retry this job if it fails
num_of_dpus (int | float | None) – Number of AWS Glue DPUs to allocate to this Job.
region_name (str | None) – aws region name (example: us-east-1)
s3_bucket (str | None) – S3 bucket where logs and local etl script will be uploaded
iam_role_name (str | None) – AWS IAM Role for Glue Job Execution. If set iam_role_arn must equal None.
iam_role_arn (str | None) – AWS IAM ARN for Glue Job Execution. If set iam_role_name must equal None.
create_job_kwargs (dict | None) – Extra arguments for Glue Job Creation
run_job_kwargs (dict | None) – Extra arguments for Glue Job Run
wait_for_completion (bool) – Whether to wait for job run completion. (default: True)
deferrable (bool) – If True, the operator will wait asynchronously for the job to complete. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)
verbose (bool) – If True, Glue Job Run logs show in the Airflow Task Logs. (default: False)
update_config (bool) – If True, Operator will update job configuration. (default: False)
replace_script_file (bool) – If True, the script file will be replaced in S3. (default: False)
stop_job_run_on_kill (bool) – If True, Operator will stop the job run when task is killed.
sleep_before_return (int) – time in seconds to wait before returning final status. This is meaningful in case of limiting concurrency, Glue needs 5-10 seconds to clean up resources. Thus if status is returned immediately it might end up in case of more than 1 concurrent run. It is recommended to set this parameter to 10 when you are using concurrency=1. For more information see: https://repost.aws/questions/QUaKgpLBMPSGWO0iq2Fob_bw/glue-run-concurrent-jobs#ANFpCL2fRnQRqgDFuIU_rpvA
- class airflow.providers.amazon.aws.operators.glue.GlueDataQualityOperator(*, name, ruleset, description='AWS Glue Data Quality Rule Set With Airflow', update_rule_set=False, data_quality_ruleset_kwargs=None, aws_conn_id='aws_default', **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.operators.base_aws.AwsBaseOperator
[airflow.providers.amazon.aws.hooks.glue.GlueDataQualityHook
]Creates a data quality ruleset with DQDL rules applied to a specified Glue table.
See also
For more information on how to use this operator, take a look at the guide: Create an AWS Glue Data Quality
- Parameters
name (str) – A unique name for the data quality ruleset.
ruleset (str) – A Data Quality Definition Language (DQDL) ruleset. For more information, see the Glue developer guide.
description (str) – A description of the data quality ruleset.
update_rule_set (bool) – To update existing ruleset, Set this flag to True. (default: False)
data_quality_ruleset_kwargs (dict | None) – Extra arguments for RuleSet.
aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is
None
or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).region_name – AWS region_name. If not specified then the default boto3 behaviour is used.
verify – Whether or not to verify SSL certificates. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html
botocore_config – Configuration dictionary (key-values) for botocore client. See: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html
- class airflow.providers.amazon.aws.operators.glue.GlueDataQualityRuleSetEvaluationRunOperator(*, datasource, role, rule_set_names, number_of_workers=5, timeout=2880, verify_result_status=True, show_results=True, rule_set_evaluation_run_kwargs=None, wait_for_completion=True, waiter_delay=60, waiter_max_attempts=20, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), aws_conn_id='aws_default', **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.operators.base_aws.AwsBaseOperator
[airflow.providers.amazon.aws.hooks.glue.GlueDataQualityHook
]Evaluate a ruleset against a data source (Glue table).
See also
For more information on how to use this operator, take a look at the guide: Start a AWS Glue Data Quality Evaluation Run
- Parameters
datasource (dict) – The data source (Glue table) associated with this run. (templated)
role (str) – IAM role supplied for job execution. (templated)
rule_set_names (list[str]) – A list of ruleset names for evaluation. (templated)
number_of_workers (int) – The number of G.1X workers to be used in the run. (default: 5)
timeout (int) – The timeout for a run in minutes. This is the maximum time that a run can consume resources before it is terminated and enters TIMEOUT status. (default: 2,880)
verify_result_status (bool) – Validate all the ruleset rules evaluation run results, If any of the rule status is Fail or Error then an exception is thrown. (default: True)
show_results (bool) – Displays all the ruleset rules evaluation run results. (default: True)
rule_set_evaluation_run_kwargs (dict[str, Any] | None) – Extra arguments for evaluation run. (templated)
wait_for_completion (bool) – Whether to wait for job to stop. (default: True)
waiter_delay (int) – Time in seconds to wait between status checks. (default: 60)
waiter_max_attempts (int) – Maximum number of attempts to check for job completion. (default: 20)
deferrable (bool) – If True, the operator will wait asynchronously for the job to stop. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)
aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is
None
or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).region_name – AWS region_name. If not specified then the default boto3 behaviour is used.
verify – Whether or not to verify SSL certificates. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html
botocore_config – Configuration dictionary (key-values) for botocore client. See: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html
- template_fields: Sequence[str] = ('datasource', 'role', 'rule_set_names', 'rule_set_evaluation_run_kwargs')[source]¶
- class airflow.providers.amazon.aws.operators.glue.GlueDataQualityRuleRecommendationRunOperator(*, datasource, role, number_of_workers=5, timeout=2880, show_results=True, recommendation_run_kwargs=None, wait_for_completion=True, waiter_delay=60, waiter_max_attempts=20, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), aws_conn_id='aws_default', **kwargs)[source]¶
Bases:
airflow.providers.amazon.aws.operators.base_aws.AwsBaseOperator
[airflow.providers.amazon.aws.hooks.glue.GlueDataQualityHook
]Starts a recommendation run that is used to generate rules, Glue Data Quality analyzes the data and comes up with recommendations for a potential ruleset.
Recommendation runs are automatically deleted after 90 days.
See also
For more information on how to use this operator, take a look at the guide: Start a AWS Glue Data Quality Recommendation Run
- Parameters
datasource (dict) – The data source (Glue table) associated with this run. (templated)
role (str) – IAM role supplied for job execution. (templated)
number_of_workers (int) – The number of G.1X workers to be used in the run. (default: 5)
timeout (int) – The timeout for a run in minutes. This is the maximum time that a run can consume resources before it is terminated and enters TIMEOUT status. (default: 2,880)
show_results (bool) – Displays the recommended ruleset (a set of rules), when recommendation run completes. (default: True)
recommendation_run_kwargs (dict[str, Any] | None) – Extra arguments for recommendation run. (templated)
wait_for_completion (bool) – Whether to wait for job to stop. (default: True)
waiter_delay (int) – Time in seconds to wait between status checks. (default: 60)
waiter_max_attempts (int) – Maximum number of attempts to check for job completion. (default: 20)
deferrable (bool) – If True, the operator will wait asynchronously for the job to stop. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)
aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is
None
or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).region_name – AWS region_name. If not specified then the default boto3 behaviour is used.
verify – Whether or not to verify SSL certificates. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html
botocore_config – Configuration dictionary (key-values) for botocore client. See: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html