Post

Escaping the Cloud Token Trap: Building a Multi-Tenant Graph RAG System Locally

Escaping the Cloud Token Trap: Building a Multi-Tenant Graph RAG System Locally

We’ve all been there. You want to build a Retrieval-Augmented Generation (RAG) system, so you do the “standard” thing: you chunk your entire database, throw it into a vector database, and wire it up to a massive cloud LLM.

Then the bill arrives.

Worse, you realize that while the cloud model is brilliant at reasoning, it suffers from “lost in the middle” syndrome when fed thousands of tokens of irrelevant context. You are paying a premium to confuse the smartest models on the market.

The solution isn’t to stop using cloud models—it’s to stop using them for the dirty work. By shifting the indexing, embedding, and initial retrieval phases to a local environment using tools like Ollama and LightRAG, you can build a highly precise, multi-tenant knowledge graph that only sends the most critical context to the expensive models.

Here is how we designed a local, Hub-and-Spoke Graph RAG system using PostgreSQL, and the deep architectural lessons we learned along the way.


🏗️ The Architecture: Hub-and-Spoke Graph RAG

Traditional RAG relies heavily on Vector Databases (Dense Retrieval). While great for finding semantically similar sentences, vectors are terrible at understanding relationships. If a customer asks, “How does feature X affect my billing?”, a pure vector search might pull up a document about Feature X, and a separate document about Billing, but completely miss the bridge between them.

This is where LightRAG steps in. It builds a Knowledge Graph (GraphRAG), extracting entities (nodes) and their relationships (edges).

Our architecture uses a central Python Gateway to orchestrate this data flow from our existing PostgreSQL database into isolated LightRAG workspaces.

The Core Stack

  • Source of Truth: PostgreSQL (Housing raw app data, customer logs, and markdown docs).
  • The Engine: Ollama running llama3.1 (for local reasoning/extraction) and nomic-embed-text (for embeddings).
  • The Brain: LightRAG, operating via a Dockerized container.
  • The Gateway: A FastAPI layer enforcing strict “Workspace” routing.

By running a daily ETL (Extract, Transform, Load) script, we pull rows from Postgres, format them into highly structured markdown blocks, and push them to specific LightRAG workspaces (e.g., customer_facing, internal_dev, product_codex).


🧠 Deep Thoughts: Why This Approach Wins

1. The Power of “Data Partitioning” in RAG

One of the biggest mistakes in enterprise AI is creating a single, monolithic vector database. If you dump your API documentation, product roadmaps, and English learning resources into the same index, the LLM’s latent space gets muddy.

By utilizing Multi-Workspace Isolation, we apply a principle as old as database design: separation of concerns. The Gateway ensures that a customer asking a grammar question only queries the customer_facing graph, while a developer debugging an endpoint queries the internal_dev graph. It drastically reduces hallucination and limits unauthorized data access.

2. Graph Retrieval > Vector Retrieval for Complex Codebases

Code and product documentation are inherently relational. A function calls another function; a product feature requires a specific database schema.

When LightRAG processes a document, it doesn’t just store the text. It uses the local Ollama model to actively read the text and extract a graph. The initial ingestion is slow—local LLMs churn hard to build these relationships—but the querying is lightning fast. When you query the system, you perform a Hybrid Retrieval: BM25 (exact keyword) + Vector (semantic) + Graph (relational).

3. Asymmetric Compute Costs

We leverage a cheap, local LLM to do the heavy lifting of reading and mapping the data (the Graph extraction phase). We only use expensive cloud models (like Gemini) when the user asks a highly complex, reasoning-heavy question. The local system retrieves the perfect 800-token context block from the graph, and we forward only that precise block to the cloud model. We turned a $100/month prompt habit into pennies.


🚀 Further Improvements: Scaling the System

While this local Docker Compose stack is incredibly powerful, preparing it for massive scale requires a few architectural evolutions.

1. Semantic Caching (Redis/Momento)

Right now, if 100 users ask the chatbot, “How do I reset my password?”, the system performs the full graph-retrieval and generation process 100 times. The Fix: Implement a Semantic Cache layer in front of the API Gateway. If the embedded intent of a new query has a 95% similarity match to a recently answered query, return the cached response instantly. This drops latency from seconds to milliseconds.

2. Agentic Routing (LLM as a Router)

Currently, our routing is deterministic (e.g., if user_type == 'dev'). As the system grows, we can deploy a micro-model (like an 8B parameter model) at the Gateway level to act purely as a “Traffic Cop.” The Fix: The Traffic Cop reads the prompt and decides which workspace to query. If a Project Manager asks, “Did the new API endpoint cause customer complaints?”, the routing agent can intelligently query both the internal_dev workspace and the customer_facing workspace, synthesizing a cross-functional answer.

3. Event-Driven Ingestion (Webhooks)

Batch syncing via cron jobs leaves the knowledge base out of date for up to 24 hours. The Fix: Move from an ETL script to an Event-Driven architecture using PostgreSQL triggers or a message broker (like RabbitMQ/Kafka). When a developer merges a pull request or updates a PRD, a webhook instantly fires a payload to LightRAG, keeping the Knowledge Graph updated in near real-time.


The Takeaway

Building AI systems isn’t just about calling the smartest API; it’s about systems engineering. By treating LLMs as modular components within a traditional software architecture—using local models for data processing and graph generation, and strictly controlling data flow via workspaces—you build systems that are not only cheaper to run, but exponentially more accurate.

The future of RAG isn’t just bigger context windows; it’s smarter, structured, local retrieval.

This post is licensed under CC BY 4.0 by the author.

© Joey. Some rights reserved.

Using the Chirpy theme for Jekyll.