airflow.providers.amazon.aws.operators.glue

Module Contents

Classes

GlueJobOperator

Create an AWS Glue Job.

GlueDataQualityOperator

Creates a data quality ruleset with DQDL rules applied to a specified Glue table.

GlueDataQualityRuleSetEvaluationRunOperator

Evaluate a ruleset against a data source (Glue table).

GlueDataQualityRuleRecommendationRunOperator

Starts a recommendation run that is used to generate rules, Glue Data Quality analyzes the data and comes up with recommendations for a potential ruleset.

class airflow.providers.amazon.aws.operators.glue.GlueJobOperator(*, job_name='aws_glue_default_job', job_desc='AWS Glue Job with Airflow', script_location=None, concurrent_run_limit=None, script_args=None, retry_limit=0, num_of_dpus=None, aws_conn_id='aws_default', region_name=None, s3_bucket=None, iam_role_name=None, iam_role_arn=None, create_job_kwargs=None, run_job_kwargs=None, wait_for_completion=True, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), verbose=False, replace_script_file=False, update_config=False, job_poll_interval=6, stop_job_run_on_kill=False, sleep_before_return=0, **kwargs)[source]

Bases: airflow.models.BaseOperator

Create an AWS Glue Job.

AWS Glue is a serverless Spark ETL service for running Spark Jobs on the AWS cloud. Language support: Python and Scala.

See also

For more information on how to use this operator, take a look at the guide: Submit an AWS Glue job

Parameters
  • job_name (str) – unique job name per AWS Account

  • script_location (str | None) – location of ETL script. Must be a local or S3 path

  • job_desc (str) – job description details

  • concurrent_run_limit (int | None) – The maximum number of concurrent runs allowed for a job

  • script_args (dict | None) – etl script arguments and AWS Glue arguments (templated)

  • retry_limit (int) – The maximum number of times to retry this job if it fails

  • num_of_dpus (int | float | None) – Number of AWS Glue DPUs to allocate to this Job.

  • region_name (str | None) – aws region name (example: us-east-1)

  • s3_bucket (str | None) – S3 bucket where logs and local etl script will be uploaded

  • iam_role_name (str | None) – AWS IAM Role for Glue Job Execution. If set iam_role_arn must equal None.

  • iam_role_arn (str | None) – AWS IAM ARN for Glue Job Execution. If set iam_role_name must equal None.

  • create_job_kwargs (dict | None) – Extra arguments for Glue Job Creation

  • run_job_kwargs (dict | None) – Extra arguments for Glue Job Run

  • wait_for_completion (bool) – Whether to wait for job run completion. (default: True)

  • deferrable (bool) – If True, the operator will wait asynchronously for the job to complete. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)

  • verbose (bool) – If True, Glue Job Run logs show in the Airflow Task Logs. (default: False)

  • update_config (bool) – If True, Operator will update job configuration. (default: False)

  • replace_script_file (bool) – If True, the script file will be replaced in S3. (default: False)

  • stop_job_run_on_kill (bool) – If True, Operator will stop the job run when task is killed.

  • sleep_before_return (int) – time in seconds to wait before returning final status. This is meaningful in case of limiting concurrency, Glue needs 5-10 seconds to clean up resources. Thus if status is returned immediately it might end up in case of more than 1 concurrent run. It is recommended to set this parameter to 10 when you are using concurrency=1. For more information see: https://repost.aws/questions/QUaKgpLBMPSGWO0iq2Fob_bw/glue-run-concurrent-jobs#ANFpCL2fRnQRqgDFuIU_rpvA

template_fields: Sequence[str] = ('job_name', 'script_location', 'script_args', 'create_job_kwargs', 's3_bucket',...[source]
template_ext: Sequence[str] = ()[source]
template_fields_renderers[source]
ui_color = '#ededed'[source]
glue_job_hook()[source]
execute(context)[source]

Execute AWS Glue Job from Airflow.

Returns

the current Glue job ID.

execute_complete(context, event=None)[source]
on_kill()[source]

Cancel the running AWS Glue Job.

class airflow.providers.amazon.aws.operators.glue.GlueDataQualityOperator(*, name, ruleset, description='AWS Glue Data Quality Rule Set With Airflow', update_rule_set=False, data_quality_ruleset_kwargs=None, aws_conn_id='aws_default', **kwargs)[source]

Bases: airflow.providers.amazon.aws.operators.base_aws.AwsBaseOperator[airflow.providers.amazon.aws.hooks.glue.GlueDataQualityHook]

Creates a data quality ruleset with DQDL rules applied to a specified Glue table.

See also

For more information on how to use this operator, take a look at the guide: Create an AWS Glue Data Quality

Parameters
  • name (str) – A unique name for the data quality ruleset.

  • ruleset (str) – A Data Quality Definition Language (DQDL) ruleset. For more information, see the Glue developer guide.

  • description (str) – A description of the data quality ruleset.

  • update_rule_set (bool) – To update existing ruleset, Set this flag to True. (default: False)

  • data_quality_ruleset_kwargs (dict | None) – Extra arguments for RuleSet.

  • aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).

  • region_name – AWS region_name. If not specified then the default boto3 behaviour is used.

  • verify – Whether or not to verify SSL certificates. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html

  • botocore_config – Configuration dictionary (key-values) for botocore client. See: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html

aws_hook_class[source]
template_fields: Sequence[str] = ('name', 'ruleset', 'description', 'data_quality_ruleset_kwargs')[source]
template_fields_renderers[source]
ui_color = '#ededed'[source]
validate_inputs()[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

class airflow.providers.amazon.aws.operators.glue.GlueDataQualityRuleSetEvaluationRunOperator(*, datasource, role, rule_set_names, number_of_workers=5, timeout=2880, verify_result_status=True, show_results=True, rule_set_evaluation_run_kwargs=None, wait_for_completion=True, waiter_delay=60, waiter_max_attempts=20, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), aws_conn_id='aws_default', **kwargs)[source]

Bases: airflow.providers.amazon.aws.operators.base_aws.AwsBaseOperator[airflow.providers.amazon.aws.hooks.glue.GlueDataQualityHook]

Evaluate a ruleset against a data source (Glue table).

See also

For more information on how to use this operator, take a look at the guide: Start a AWS Glue Data Quality Evaluation Run

Parameters
  • datasource (dict) – The data source (Glue table) associated with this run. (templated)

  • role (str) – IAM role supplied for job execution. (templated)

  • rule_set_names (list[str]) – A list of ruleset names for evaluation. (templated)

  • number_of_workers (int) – The number of G.1X workers to be used in the run. (default: 5)

  • timeout (int) – The timeout for a run in minutes. This is the maximum time that a run can consume resources before it is terminated and enters TIMEOUT status. (default: 2,880)

  • verify_result_status (bool) – Validate all the ruleset rules evaluation run results, If any of the rule status is Fail or Error then an exception is thrown. (default: True)

  • show_results (bool) – Displays all the ruleset rules evaluation run results. (default: True)

  • rule_set_evaluation_run_kwargs (dict[str, Any] | None) – Extra arguments for evaluation run. (templated)

  • wait_for_completion (bool) – Whether to wait for job to stop. (default: True)

  • waiter_delay (int) – Time in seconds to wait between status checks. (default: 60)

  • waiter_max_attempts (int) – Maximum number of attempts to check for job completion. (default: 20)

  • deferrable (bool) – If True, the operator will wait asynchronously for the job to stop. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)

  • aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).

  • region_name – AWS region_name. If not specified then the default boto3 behaviour is used.

  • verify – Whether or not to verify SSL certificates. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html

  • botocore_config – Configuration dictionary (key-values) for botocore client. See: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html

aws_hook_class[source]
template_fields: Sequence[str] = ('datasource', 'role', 'rule_set_names', 'rule_set_evaluation_run_kwargs')[source]
template_fields_renderers[source]
ui_color = '#ededed'[source]
validate_inputs()[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

execute_complete(context, event=None)[source]
class airflow.providers.amazon.aws.operators.glue.GlueDataQualityRuleRecommendationRunOperator(*, datasource, role, number_of_workers=5, timeout=2880, show_results=True, recommendation_run_kwargs=None, wait_for_completion=True, waiter_delay=60, waiter_max_attempts=20, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), aws_conn_id='aws_default', **kwargs)[source]

Bases: airflow.providers.amazon.aws.operators.base_aws.AwsBaseOperator[airflow.providers.amazon.aws.hooks.glue.GlueDataQualityHook]

Starts a recommendation run that is used to generate rules, Glue Data Quality analyzes the data and comes up with recommendations for a potential ruleset.

Recommendation runs are automatically deleted after 90 days.

See also

For more information on how to use this operator, take a look at the guide: Start a AWS Glue Data Quality Recommendation Run

Parameters
  • datasource (dict) – The data source (Glue table) associated with this run. (templated)

  • role (str) – IAM role supplied for job execution. (templated)

  • number_of_workers (int) – The number of G.1X workers to be used in the run. (default: 5)

  • timeout (int) – The timeout for a run in minutes. This is the maximum time that a run can consume resources before it is terminated and enters TIMEOUT status. (default: 2,880)

  • show_results (bool) – Displays the recommended ruleset (a set of rules), when recommendation run completes. (default: True)

  • recommendation_run_kwargs (dict[str, Any] | None) – Extra arguments for recommendation run. (templated)

  • wait_for_completion (bool) – Whether to wait for job to stop. (default: True)

  • waiter_delay (int) – Time in seconds to wait between status checks. (default: 60)

  • waiter_max_attempts (int) – Maximum number of attempts to check for job completion. (default: 20)

  • deferrable (bool) – If True, the operator will wait asynchronously for the job to stop. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)

  • aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).

  • region_name – AWS region_name. If not specified then the default boto3 behaviour is used.

  • verify – Whether or not to verify SSL certificates. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html

  • botocore_config – Configuration dictionary (key-values) for botocore client. See: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html

aws_hook_class[source]
template_fields: Sequence[str] = ('datasource', 'role', 'recommendation_run_kwargs')[source]
template_fields_renderers[source]
ui_color = '#ededed'[source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

execute_complete(context, event=None)[source]

Was this entry helpful?