airflow.providers.common.ai.operators.llamaindex_retrieval¶
Operator for semantic retrieval via a persisted LlamaIndex index.
Classes¶
Retrieve relevant document chunks from a persisted LlamaIndex index. |
Module Contents¶
- class airflow.providers.common.ai.operators.llamaindex_retrieval.LlamaIndexRetrievalOperator(*, query, index_persist_dir, persist_conn_id=None, embed_model=None, llm_conn_id=None, embed_conn_id=None, top_k=5, **kwargs)[source]¶
Bases:
airflow.providers.common.compat.sdk.BaseOperatorRetrieve relevant document chunks from a persisted LlamaIndex index.
Loads a previously persisted vector store index (from
LlamaIndexEmbeddingOperator(persist_dir=...)) and performs similarity search against the provided query. Output is a list of chunks with text, score, metadata, and node id, ready for downstream synthesis viaLLMOperator.Passes the embedding model directly to
load_index_from_storage(..., embed_model=...)– no LlamaIndexSettingsmutation, so concurrent tasks in the same worker don’t race on shared state.- Parameters:
query (str) – The query string. Supports Jinja templating.
index_persist_dir (str) – Local path or storage URI (
s3://,gs://, …) pointing at the persisted LlamaIndex index. Resolved viaObjectStoragePathwhen a URI scheme is present.persist_conn_id (str | None) – Airflow connection ID for cloud-storage credentials when
index_persist_diris a URI.embed_model (str | llama_index.core.base.embeddings.base.BaseEmbedding | None) –
Either:
a string model name (e.g.
"text-embedding-3-small") – the operator constructs anLlamaIndexHook-backedOpenAIEmbeddingfromllm_conn_id/embed_conn_id, ora pre-built
BaseEmbeddinginstance – bypass the hook for non-OpenAI vendors. Must match the embedding model used when the index was originally built.
Templated, so it works with both literal strings and
@taskoutput that builds a custom embedder.llm_conn_id (str | None) – Airflow connection ID for the embedding API. Falls back to
LlamaIndexHook.default_conn_namewhenNone. Used only whenembed_modelis a string (or omitted entirely).embed_conn_id (str | None) – Optional separate Airflow connection ID for the embedding provider. Falls back to
llm_conn_idwhenNone.top_k (int) – Number of top results to retrieve.
- template_fields: collections.abc.Sequence[str] = ('query', 'index_persist_dir', 'persist_conn_id', 'embed_model', 'llm_conn_id', 'embed_conn_id')[source]¶