airflow.providers.amazon.aws.operators.comprehend

Module Contents

Classes

ComprehendBaseOperator

This is the base operator for Comprehend Service operators (not supposed to be used directly in DAGs).

ComprehendStartPiiEntitiesDetectionJobOperator

Create a comprehend pii entities detection job for a collection of documents.

ComprehendCreateDocumentClassifierOperator

Create a comprehend document classifier that can categorize documents.

class airflow.providers.amazon.aws.operators.comprehend.ComprehendBaseOperator(input_data_config, output_data_config, data_access_role_arn, language_code, **kwargs)[source]

Bases: airflow.providers.amazon.aws.operators.base_aws.AwsBaseOperator[airflow.providers.amazon.aws.hooks.comprehend.ComprehendHook]

This is the base operator for Comprehend Service operators (not supposed to be used directly in DAGs).

Parameters
  • input_data_config (dict) – The input properties for a PII entities detection job. (templated)

  • output_data_config (dict) – Provides configuration parameters for the output of PII entity detection jobs. (templated)

  • data_access_role_arn (str) – The Amazon Resource Name (ARN) of the IAM role that grants Amazon Comprehend read access to your input data. (templated)

  • language_code (str) – The language of the input documents. (templated)

aws_hook_class[source]
template_fields: collections.abc.Sequence[str][source]
template_fields_renderers: ClassVar[dict][source]
client()[source]

Create and return the Comprehend client.

abstract execute(context)[source]

Must overwrite in child classes.

class airflow.providers.amazon.aws.operators.comprehend.ComprehendStartPiiEntitiesDetectionJobOperator(input_data_config, output_data_config, mode, data_access_role_arn, language_code, start_pii_entities_kwargs=None, wait_for_completion=True, waiter_delay=60, waiter_max_attempts=20, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), **kwargs)[source]

Bases: ComprehendBaseOperator

Create a comprehend pii entities detection job for a collection of documents.

See also

For more information on how to use this operator, take a look at the guide: Create an Amazon Comprehend Start PII Entities Detection Job

Parameters
  • input_data_config (dict) – The input properties for a PII entities detection job. (templated)

  • output_data_config (dict) – Provides configuration parameters for the output of PII entity detection jobs. (templated)

  • mode (str) – Specifies whether the output provides the locations (offsets) of PII entities or a file in which PII entities are redacted. If you set the mode parameter to ONLY_REDACTION. In that case you must provide a RedactionConfig in start_pii_entities_kwargs.

  • data_access_role_arn (str) – The Amazon Resource Name (ARN) of the IAM role that grants Amazon Comprehend read access to your input data. (templated)

  • language_code (str) – The language of the input documents. (templated)

  • start_pii_entities_kwargs (dict[str, Any] | None) – Any optional parameters to pass to the job. If JobName is not provided in start_pii_entities_kwargs, operator will create.

  • wait_for_completion (bool) – Whether to wait for job to stop. (default: True)

  • waiter_delay (int) – Time in seconds to wait between status checks. (default: 60)

  • waiter_max_attempts (int) – Maximum number of attempts to check for job completion. (default: 20)

  • deferrable (bool) – If True, the operator will wait asynchronously for the job to stop. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)

  • aws_conn_id – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).

  • region_name – AWS region_name. If not specified then the default boto3 behaviour is used.

  • verify – Whether to verify SSL certificates. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html

  • botocore_config – Configuration dictionary (key-values) for botocore client. See: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html

execute(context)[source]

Must overwrite in child classes.

execute_complete(context, event=None)[source]
class airflow.providers.amazon.aws.operators.comprehend.ComprehendCreateDocumentClassifierOperator(document_classifier_name, input_data_config, mode, data_access_role_arn, language_code, fail_on_warnings=False, output_data_config=None, document_classifier_kwargs=None, wait_for_completion=True, waiter_delay=60, waiter_max_attempts=20, deferrable=conf.getboolean('operators', 'default_deferrable', fallback=False), aws_conn_id='aws_default', **kwargs)[source]

Bases: airflow.providers.amazon.aws.operators.base_aws.AwsBaseOperator[airflow.providers.amazon.aws.hooks.comprehend.ComprehendHook]

Create a comprehend document classifier that can categorize documents.

Provide a set of training documents that are labeled with the categories.

See also

For more information on how to use this operator, take a look at the guide: Create an Amazon Comprehend Document Classifier

Parameters
  • document_classifier_name (str) – The name of the document classifier. (templated)

  • input_data_config (dict[str, Any]) – Specifies the format and location of the input data for the job. (templated)

  • mode (str) – Indicates the mode in which the classifier will be trained. (templated)

  • data_access_role_arn (str) – The Amazon Resource Name (ARN) of the IAM role that grants Amazon Comprehend read access to your input data. (templated)

  • language_code (str) – The language of the input documents. You can specify any of the languages supported by Amazon Comprehend. All documents must be in the same language. (templated)

  • fail_on_warnings (bool) – If set to True, the document classifier training job will throw an error when the status is TRAINED_WITH_WARNING. (default False)

  • output_data_config (dict[str, Any] | None) – Specifies the location for the output files from a custom classifier job. This parameter is required for a request that creates a native document model. (templated)

  • document_classifier_kwargs (dict[str, Any] | None) – Any optional parameters to pass to the document classifier. (templated)

  • wait_for_completion (bool) – Whether to wait for job to stop. (default: True)

  • waiter_delay (int) – Time in seconds to wait between status checks. (default: 60)

  • waiter_max_attempts (int) – Maximum number of attempts to check for job completion. (default: 20)

  • deferrable (bool) – If True, the operator will wait asynchronously for the job to stop. This implies waiting for completion. This mode requires aiobotocore module to be installed. (default: False)

  • aws_conn_id (str | None) – The Airflow connection used for AWS credentials. If this is None or empty then the default boto3 behaviour is used. If running Airflow in a distributed manner and aws_conn_id is None or empty, then default boto3 configuration would be used (and must be maintained on each worker node).

  • region_name – AWS region_name. If not specified then the default boto3 behaviour is used.

  • verify – Whether to verify SSL certificates. See: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html

  • botocore_config – Configuration dictionary (key-values) for botocore client. See: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html

aws_hook_class[source]
template_fields: collections.abc.Sequence[str][source]
template_fields_renderers: ClassVar[dict][source]
execute(context)[source]

Derive when creating an operator.

Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

execute_complete(context, event=None)[source]

Was this entry helpful?