airflow.providers.common.ai.operators.llamaindex_embedding¶
Operator for document chunking and embedding via LlamaIndex.
Classes¶
Chunk documents and produce embedding vectors using LlamaIndex. |
Module Contents¶
- class airflow.providers.common.ai.operators.llamaindex_embedding.LlamaIndexEmbeddingOperator(*, documents, embed_model=None, llm_conn_id=None, embed_conn_id=None, chunk_size=512, chunk_overlap=50, persist_dir=None, persist_conn_id=None, **kwargs)[source]¶
Bases:
airflow.providers.common.compat.sdk.BaseOperatorChunk documents and produce embedding vectors using LlamaIndex.
Bridges document loading (e.g.
DocumentLoaderOperatoroutput) and vector storage (pgvector, Pinecone, Weaviate, …). Input islist[dict]withtextandmetadatakeys; output includes the embedding vectors ready for downstream storage ingest.The operator passes the embedding model directly to
VectorStoreIndex(..., embed_model=...)– it does not mutate LlamaIndex’s globalSettingssingleton, so concurrent tasks in the same worker don’t race on shared state.- Parameters:
documents (list[dict[str, Any]]) – List of dicts with
textandmetadatakeys, typically fromDocumentLoaderOperatoror a@task. Templated, so binding viamy_loader.output(XCom direct) resolves to the nativelist[dict]beforeexecuteruns.embed_model (str | llama_index.core.base.embeddings.base.BaseEmbedding | None) –
Either:
a string model name (e.g.
"text-embedding-3-small") – the operator constructs anLlamaIndexHook-backedOpenAIEmbeddingfromllm_conn_id/embed_conn_id, ora pre-built
BaseEmbeddinginstance – bypass the hook entirely for non-OpenAI vendors (e.g.CohereEmbedding(...),BedrockEmbedding(...)).
Templated, so it works with both literal strings and
@taskoutput that builds a custom embedder.llm_conn_id (str | None) – Airflow connection ID for the embedding API. Falls back to
LlamaIndexHook.default_conn_namewhenNone.embed_conn_id (str | None) – Optional separate Airflow connection ID for the embedding provider. Falls back to
llm_conn_idwhenNone.chunk_size (int) – Chunk size for the sentence splitter.
chunk_overlap (int) – Overlap between chunks.
persist_dir (str | None) – Optional path to persist the index. Accepts local paths and storage URIs (
s3://,gs://, …) resolved viaObjectStoragePath.persist_conn_id (str | None) – Airflow connection ID for cloud-storage credentials when
persist_diris a URI.
- template_fields: collections.abc.Sequence[str] = ('documents', 'embed_model', 'llm_conn_id', 'embed_conn_id', 'persist_dir', 'persist_conn_id')[source]¶