LlamaIndexHook

Use LlamaIndexHook to bridge an Airflow connection to LlamaIndex chat and embedding models. The hook reads credentials (API key, optional base URL) from a connection of type llamaindex and returns native LlamaIndex objects ready to pass to VectorStoreIndex(..., embed_model=...), load_index_from_storage(..., embed_model=...), or index.as_retriever(..., llm=...).

The hook deliberately does not mutate LlamaIndex’s global Settings singleton. Operators pass the resolved model directly to LlamaIndex constructors, so concurrent tasks in the same worker don’t race on shared state.

OpenAI by default, BYO for other vendors

LlamaIndex does not ship a universal init_chat_model / init_embedding_model equivalent (each vendor is a separate package under llama-index-llms-* / llama-index-embeddings-* with its own constructor kwargs). The hook therefore covers the OpenAI-compatible surface that matches LlamaIndex’s own resolve_embed_model("default") behaviour:

  • hook.get_embedding_model() returns an OpenAIEmbedding configured from the connection.

  • hook.get_llm() returns an OpenAI LLM configured from the connection.

For other vendors (Cohere, Bedrock, Vertex AI, HuggingFace, …), instantiate the LlamaIndex class directly in a @task and pass it to the operator’s embed_model= / llm= parameter – both LlamaIndexEmbeddingOperator and LlamaIndexRetrievalOperator accept a pre-built BaseEmbedding / LLM instance and bypass the hook:

airflow/providers/common/ai/example_dags/example_llamaindex_hook.py[source]

@dag(schedule=None, tags=["example"])
def example_llamaindex_byo_embed_model():
    """Use a non-OpenAI embedding by instantiating the LlamaIndex class directly.

    LlamaIndex doesn't ship a universal init helper, so the operator accepts
    a pre-built ``BaseEmbedding`` instance and bypasses the hook entirely.
    Install the matching extra:
    ``pip install llama-index-embeddings-cohere``.
    """

    @task
    def build_cohere_embedder():
        from llama_index.embeddings.cohere import CohereEmbedding

        from airflow.providers.common.compat.sdk import BaseHook

        conn = BaseHook.get_connection("cohere_default")
        return CohereEmbedding(model_name="embed-english-v3.0", cohere_api_key=conn.password)

    @task
    def empty_doc_list() -> list[dict]:
        return [{"text": "Cohere demo content", "metadata": {}}]

    embed = LlamaIndexEmbeddingOperator(
        task_id="embed",
        documents=empty_doc_list(),
        embed_model=build_cohere_embedder(),
        persist_dir="/opt/airflow/data/cohere_index",
    )

    embed


Install the per-vendor LlamaIndex integration package separately: pip install llama-index-embeddings-cohere, ...-bedrock, ...-huggingface, llama-index-llms-anthropic, etc.

Connection Configuration

The hook reads credentials from the Airflow connection of type llamaindex:

  • password – API key (passed as api_key to OpenAIEmbedding / OpenAI).

  • host – Optional base URL (passed as api_base; useful for custom OpenAI-compatible endpoints, Ollama, vLLM).

  • extra JSON – {"embed_model": "text-embedding-3-small", "llm_model": "gpt-4o"} – default model identifiers stored on the connection.

Parameters

Parameter

Default

Description

llm_conn_id

llamaindex_default

Airflow connection ID for the LLM/embedding provider.

embed_conn_id

None (falls back to llm_conn_id)

Optional separate Airflow connection ID for the embedding provider.

embed_model

None (falls back to extra["embed_model"])

Embedding model name, e.g. text-embedding-3-small.

llm_model

None (falls back to extra["llm_model"])

LLM model name, e.g. gpt-4o. Required when calling get_llm().

Dependencies

Install the llamaindex extra:

pip install apache-airflow-providers-common-ai[llamaindex]

That extra installs llama-index-core, llama-index-embeddings-openai, and llama-index-llms-openai – enough to back the hook’s default OpenAI return values. For other LlamaIndex vendor packages, install their integration package separately.

Was this entry helpful?