airflow.providers.common.ai.operators.document_loader¶
Attributes¶
Classes¶
Parse files into |
Module Contents¶
- class airflow.providers.common.ai.operators.document_loader.DocumentLoaderOperator(*, source_path=None, source_conn_id=None, source_bytes=None, file_type=None, parser='auto', file_extensions=None, metadata_fields=None, encoding='utf-8', encoding_errors='strict', json_text_field=None, **kwargs)[source]¶
Bases:
airflow.providers.common.compat.sdk.BaseOperatorParse files into
list[dict(text, metadata)]for downstream embedding.Bridges Airflow’s connectivity layer (hooks that produce bytes or local files) and the AI embedding layer (operators that need structured text with metadata). Framework-agnostic: no LlamaIndex, LangChain, or other AI framework dependency.
Built-in parsers handle
.txt,.md,.csv, and.jsonwith zero extra dependencies. PDF and DOCX support require optional packages installable via extras:pip install apache-airflow-providers-common-ai[pdf] # pypdf pip install apache-airflow-providers-common-ai[docx] # python-docx
Provide exactly one of
source_pathorsource_bytes. When usingsource_bytes,file_typeis required so the operator knows which parser to use.The operator is intentionally a loader: it does not split documents into fixed-size chunks. Pass the output to a downstream text-splitter or embedding operator if you need chunking.
- Parameters:
source_path (str | None) – A local path, glob pattern, or storage URI (
s3://,gs://,azure://,file://, …). Cloud URIs go throughObjectStoragePath/ fsspec.**enables recursive matching for local globs. Cloud URIs accept a single file or a directory; cross-directory globs in a cloud URI are not supported in this version.source_conn_id (str | None) – Airflow connection ID used by
ObjectStoragePathfor cloud URIs (aws_default,google_cloud_default, …). Ignored for local paths.source_bytes (bytes | None) – Raw file bytes, typically from XCom.
file_type (str | None) – File extension hint when using
source_bytes(e.g.".pdf"). Also accepted withsource_pathto override auto-detection.parser (str) – Parsing backend selection.
"auto"(default) picks the backend from the file extension.file_extensions (list[str] | None) – When
source_pathis a directory or glob, only process files whose extension is in this list. When omitted, the operator processes only files whose extension is known to the built-in dispatch (others are skipped with a warning) and silently ignores files whose name starts with a dot.metadata_fields (dict[str, Any] | None) – Extra key-value pairs merged into every document’s
metadatadict. Auto-extracted fields such asfile_name,file_path,row_index,item_index, andpage_numbertake precedence over keys with the same name.encoding (str) – Text encoding used for
.txt/.md/.csv/.jsonand for the bytes path. Defaults to"utf-8".encoding_errors (str) – How decode errors are handled. Defaults to
"strict"; set to"replace"or"ignore"to tolerate mixed-encoding inputs at the cost of some character loss.json_text_field (str | None) – When parsing JSON, treat this key as the embedding text and put every other key into
metadata. Applies to each item when the top-level JSON is a list, or to the object when it is a single dict. WhenNone(default), the operator flattens dicts into"k: v, k: v"text (same shape as the CSV parser).
- template_fields: collections.abc.Sequence[str] = ('source_path', 'source_conn_id', 'file_type', 'file_extensions', 'parser', 'metadata_fields')[source]¶