airflow.providers.common.ai.operators.document_loader

Attributes

FilePathT

Classes

DocumentLoaderOperator

Parse files into list[dict(text, metadata)] for downstream embedding.

Module Contents

airflow.providers.common.ai.operators.document_loader.FilePathT[source]
class airflow.providers.common.ai.operators.document_loader.DocumentLoaderOperator(*, source_path=None, source_conn_id=None, source_bytes=None, file_type=None, parser='auto', file_extensions=None, metadata_fields=None, encoding='utf-8', encoding_errors='strict', json_text_field=None, **kwargs)[source]

Bases: airflow.providers.common.compat.sdk.BaseOperator

Parse files into list[dict(text, metadata)] for downstream embedding.

Bridges Airflow’s connectivity layer (hooks that produce bytes or local files) and the AI embedding layer (operators that need structured text with metadata). Framework-agnostic: no LlamaIndex, LangChain, or other AI framework dependency.

Built-in parsers handle .txt, .md, .csv, and .json with zero extra dependencies. PDF and DOCX support require optional packages installable via extras:

pip install apache-airflow-providers-common-ai[pdf]    # pypdf
pip install apache-airflow-providers-common-ai[docx]   # python-docx

Provide exactly one of source_path or source_bytes. When using source_bytes, file_type is required so the operator knows which parser to use.

The operator is intentionally a loader: it does not split documents into fixed-size chunks. Pass the output to a downstream text-splitter or embedding operator if you need chunking.

Parameters:
  • source_path (str | None) – A local path, glob pattern, or storage URI (s3://, gs://, azure://, file://, …). Cloud URIs go through ObjectStoragePath / fsspec. ** enables recursive matching for local globs. Cloud URIs accept a single file or a directory; cross-directory globs in a cloud URI are not supported in this version.

  • source_conn_id (str | None) – Airflow connection ID used by ObjectStoragePath for cloud URIs (aws_default, google_cloud_default, …). Ignored for local paths.

  • source_bytes (bytes | None) – Raw file bytes, typically from XCom.

  • file_type (str | None) – File extension hint when using source_bytes (e.g. ".pdf"). Also accepted with source_path to override auto-detection.

  • parser (str) – Parsing backend selection. "auto" (default) picks the backend from the file extension.

  • file_extensions (list[str] | None) – When source_path is a directory or glob, only process files whose extension is in this list. When omitted, the operator processes only files whose extension is known to the built-in dispatch (others are skipped with a warning) and silently ignores files whose name starts with a dot.

  • metadata_fields (dict[str, Any] | None) – Extra key-value pairs merged into every document’s metadata dict. Auto-extracted fields such as file_name, file_path, row_index, item_index, and page_number take precedence over keys with the same name.

  • encoding (str) – Text encoding used for .txt/.md/.csv/.json and for the bytes path. Defaults to "utf-8".

  • encoding_errors (str) – How decode errors are handled. Defaults to "strict"; set to "replace" or "ignore" to tolerate mixed-encoding inputs at the cost of some character loss.

  • json_text_field (str | None) – When parsing JSON, treat this key as the embedding text and put every other key into metadata. Applies to each item when the top-level JSON is a list, or to the object when it is a single dict. When None (default), the operator flattens dicts into "k: v, k: v" text (same shape as the CSV parser).

template_fields: collections.abc.Sequence[str] = ('source_path', 'source_conn_id', 'file_type', 'file_extensions', 'parser', 'metadata_fields')[source]
EXTENSION_BACKEND_MAP: dict[str, str][source]
source_path = None[source]
source_conn_id = None[source]
source_bytes = None[source]
file_type = None[source]
parser = 'auto'[source]
file_extensions = None[source]
metadata_fields = None[source]
encoding = 'utf-8'[source]
encoding_errors = 'strict'[source]
json_text_field = None[source]
execute(context)[source]

Derive when creating an operator.

The main method to execute the task. Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

Was this entry helpful?