Building a Generic RAG System: Integrating LLMs into a Knowledge-Aware Agent

Background

Most RAG tutorials stop at the retrieval part - chunk your docs, embed them, do a similarity search, stuff the context into a prompt. That works. But when you’re building something meant to last beyond a hackathon, several questions start appearing fast:

What happens when you want to swap GPT-4o for Gemini?
What if the user’s question can’t be answered from the knowledge base at all?
How do you give the LLM access to tools without letting it run wild?

These are the questions that shaped generic-rag: a full-stack RAG system built around a provider-agnostic, agent-based architecture. The goal was to build something genuinely usable - not just a proof of concept.

The Core Idea: RAG as a Pipeline, Not a Prompt

Before writing any code, I had to decide how to structure the interaction between the user, the knowledge base, and the LLM. A naive approach would be:

user question → similarity search → stuff into prompt → LLM → answer

That works for demos. But it breaks when:

The retrieved chunks aren’t relevant enough
The question requires combining knowledge base results with external information
You need to trace exactly where the answer came from

Instead, I structured the system around two execution modes: a RAG chain for direct retrieval-to-answer, and a ReAct agent for more complex, multi-step reasoning. Both are built on LangGraph.

Architecture Overview

User Input
    │
    ▼
FastAPI Endpoint
    │
    ├─── /knowledge  →  Ingestion Service (chunking + embedding + ChromaDB)
    │
    └─── /chat       →  ReAct Agent (LangGraph)
                             │
                             ├── search_knowledge  (ChromaDB)
                             ├── search_web        (Tavily)
                             └── microservice      (internal APIs)

The backend is a FastAPI app. ChromaDB handles local persistent vector storage. LangGraph orchestrates the agent loop. LangChain provides the glue between embeddings, LLMs, and tools.

Provider Abstraction: The Part That Matters Most

The first real engineering decision was how to handle LLM and embedding providers. The naive approach is to hardcode the model. The right approach is to treat the LLM as a dependency - something you inject, not something you embed in logic.

Here’s how llm.py looks:

def get_llm() -> BaseChatModel:
    if settings.llm_provider == "openai":
        from langchain_openai import ChatOpenAI
        return ChatOpenAI(model="gpt-4o", api_key=settings.openai_api_key)
    elif settings.llm_provider == "gemini":
        from langchain_google_genai import ChatGoogleGenerativeAI
        return ChatGoogleGenerativeAI(
            model="gemini-2.5-flash",
            google_api_key=settings.google_api_key,
            include_thoughts=True,
            stream_usage=True,
        )
    raise ValueError(f"Unsupported LLM provider: {settings.llm_provider}")

Same pattern for embeddings - get_embeddings() returns a Google, OpenAI, or LM Studio embedding instance based on config. The rest of the codebase never touches provider-specific code.

Why lazy imports inside the function? Because loading langchain_openai and langchain_google_genai at module level means both packages must be installed. With lazy imports, you only pay the import cost for the provider you’re actually using.

This is what makes the system genuinely provider-agnostic - and it also made the reindex_all() endpoint possible: when you switch embedding providers, you can re-embed your entire knowledge base without touching any document data.

Document Ingestion: From Bytes to Vector Embeddings

The ingestion pipeline is in services/ingestion.py. The flow is:

Accept text or file (PDF / TXT / MD)
Extract raw text via pypdf for PDFs, UTF-8 decode for others
Split into chunks using RecursiveCharacterTextSplitter (500 chars, 50 overlap)
Enrich metadata: doc_id, title, source_type, created_at, chunk_index
Store in ChromaDB via get_vectorstore()

The metadata enrichment step is easy to skip in a prototype, but critical in production. When the LLM returns a source citation, it reads doc.metadata["title"] - if that field is empty, citations become useless. Every chunk carries enough context to be self-contained.

The reindex_all() function is particularly careful: it writes to a temporary collection first, then swaps it with the real one only after successful completion. If embedding fails halfway through, the original data remains intact. This is a small but important detail when you’re operating over someone’s accumulated knowledge base.

The RAG Chain: LangGraph State Machine

For straightforward retrieval-and-answer scenarios, I built a simple two-node LangGraph graph in core/rag_chain.py.

RAGState
    │
    ▼
[retrieve]  →  similarity_search_with_relevance_scores()
                filter by similarity_threshold (default: 0.6)
    │
    ▼
[generate]  →  build context from docs + conversation history
                call LLM with strict or flexible system prompt
                extract and deduplicate sources
    │
    ▼
END

RAGState is a TypedDict that flows through the graph:

class RAGState(TypedDict):
    question: str
    history: list[dict]
    documents: list[Document]
    answer: str
    sources: list[SourceInfo]
    strict: bool

The strict flag changes the system prompt. In strict mode, the LLM is instructed to only use the provided context - useful when you want controlled, citation-backed answers. In flexible mode, it can supplement with general knowledge.

The similarity threshold (0.6 by default) acts as a quality gate. Documents below the threshold are dropped before they reach the generate node - better to say “I don’t know” than to hallucinate from low-relevance chunks.

The ReAct Agent: When RAG Isn’t Enough

The RAG chain works well for direct knowledge questions. But real users ask things like: “Based on what I uploaded, how does this compare to what’s happening in the market right now?”

That question requires both internal knowledge base retrieval and external web search. A static chain can’t do that. An agent can.

core/react_agent.py implements a LangGraph ReAct agent with three tools:

Tool	Purpose
`search_knowledge`	Query ChromaDB, filter by similarity threshold, return chunks with scores
`search_web`	Tavily-powered web search for real-time information
`microservice`	Call internal APIs (extendable for domain-specific backends)

The agent’s system prompt enforces a priority: always try search_knowledge first. This prevents the LLM from skipping the knowledge base entirely and going straight to web search.

Tool Approval Gate

One design decision I’m proud of: non-knowledge tools require explicit user approval before execution.

When the agent decides to call search_web or microservice, it hits an interrupt(). The graph pauses, returns the pending tool call to the frontend, and waits. The user can approve, reject, or modify the tool arguments. Only then does execution resume.

This is important for trust. Users should know when the system is reaching outside their knowledge base. It’s also useful for debugging - you can see exactly what the agent was about to do.

State Persistence and Checkpointing

Multi-turn conversations require state. LangGraph’s MemorySaver keeps the agent’s message history in memory, keyed by thread ID. Each user session gets its own thread.

The code notes that MemorySaver is development-only - for production multi-instance deployments, you’d replace it with Redis-based persistence. The abstraction LangGraph uses makes this swap straightforward.

The `search_knowledge` Tool Under the Hood

This is the bridge between the agent and ChromaDB:

async def _execute(self, query: str) -> str:
    vectorstore = get_vectorstore()
    results = vectorstore.similarity_search_with_relevance_scores(
        query, k=settings.top_k_results
    )
    relevant = [
        (doc, score) for doc, score in results
        if score >= settings.similarity_threshold
    ]
    if not relevant:
        return "No relevant information found in the knowledge base."
    
    parts = []
    for doc, score in relevant:
        meta = doc.metadata
        parts.append(
            f"[source: {meta['title']} | doc_id: {meta['doc_id']} | score: {score:.2f}]\n"
            f"{doc.page_content}"
        )
    return "\n\n---\n\n".join(parts)

The output format matters: [source: ... | doc_id: ... | score: ...] gives the LLM structured metadata it can use when generating citations. The agent learns to parse this format from the tool description - no special parsing code needed on the LLM side.

Configuration: Everything Through `.env`

The entire system is configured through environment variables, read into Pydantic Settings:

class Settings(BaseSettings):
    llm_provider: Literal["openai", "gemini"] = "gemini"
    embedding_provider: Literal["google", "openai", "lmstudio"] = "google"
    
    chunk_size: int = 500
    chunk_overlap: int = 50
    top_k_results: int = 5
    similarity_threshold: float = 0.6
    max_history_turns: int = 5

Pydantic validates types at startup, so misconfiguration fails fast - not silently at request time. Switching from Gemini to GPT-4o is a one-line .env change.

What I Learned

The LLM integration is the hard part, not the retrieval. Chunking and embedding documents is well-documented and largely solved. The interesting engineering happens around: how does the LLM decide when it has enough context? How do you stop it from confidently answering from irrelevant chunks? How do you give it tools without losing control?

Strict vs. flexible mode is underrated. Most RAG systems pick one answer quality - either always grounded or always generative. Having both modes, switchable per request, turned out to be genuinely useful in practice.

Tool approval gates make agents trustworthy. An agent that can silently call external APIs is unsettling. Making non-knowledge tool calls visible and interruptible changes the user’s relationship with the system from “black box” to “collaborator.”

Provider abstraction pays off early. I switched from an earlier embedding model to gemini-embedding-001 mid-project. Because everything routes through get_embeddings(), the change was a single line in config plus a POST /api/v1/knowledge/reindex.

What’s Next

The current system uses in-memory checkpointing. The obvious next step is persistent checkpointing (Redis or a database) to support multi-user, multi-session deployments without losing conversation history on restart.

The microservice tool is also intentionally generic - the goal is to let the agent call domain-specific internal APIs without rebuilding the agent layer. That’s the path toward turning this from a “chat with your documents” demo into something that can actually do work in a real production environment.

The repo is open: github.com/KevinNathanaelTaufiek/generic-rag