How to Build a RAG Pipeline: A Practical Guide for 2026
6 min read · Updated Mar 30, 2026
RAG (Retrieval-Augmented Generation) is the workaround for the one thing LLMs are still bad at: knowing your specific data. This guide walks through a working pipeline with code you can copy, the parts everyone gets wrong on the first try, and the moments where I have watched RAG projects either ship or collapse. By the end you will have a chunker, an embedder, a retriever, and a prompt template that produce grounded answers — plus an honest opinion on when RAG is the wrong tool.
Key takeaways
- RAG = retrieval (find relevant chunks) + generation (ask the LLM with those chunks as context). Two systems, one pipeline.
- Chunking is where most pipelines fail. Start at 300–500 tokens with 50-token overlap; tune from there.
- pgvector beats Pinecone on price for under one million vectors, and is one fewer service to babysit.
- Cosine similarity is not the same as relevance — a re-ranker on top of vector search adds 15–30% accuracy in our tests.
- Anthropic’s 2024 Contextual Retrieval study showed up to 49% fewer failed retrievals with contextual embeddings. The technique adds one preprocessing step and changes everything.
What RAG actually is (and what it is not)
RAG is a retrieval system bolted onto a generation system. Given a user question, you find the most semantically relevant chunks of your private data, paste them into the prompt as context, and ask the LLM to answer using only that context. It is not fine-tuning. Fine-tuning changes the weights of a model so it talks differently; RAG changes the prompt so the model has the right facts in front of it. RAG wins almost every time when your data changes frequently, when you need source citations, or when you cannot afford a training run.
The pipeline at 30,000 feet
- Ingest — load source docs (PDFs, HTML, Markdown, Notion exports) and clean the text.
- Chunk — split each doc into 300–500 token windows with 10–20% overlap.
- Embed — convert each chunk into a vector with an embedding model (text-embedding-3-small is the cheap default).
- Store — write vectors plus metadata to a vector store (pgvector, Qdrant, Pinecone).
- Retrieve — embed the user question, do a top-k nearest neighbour search, optionally re-rank.
- Generate — stuff retrieved chunks into a prompt template and call the LLM with citations.
Choosing your vector store in 2026
| Store | Best for | Starting price | Hybrid search |
|---|---|---|---|
| pgvector (Postgres) | Under 1M vectors, existing Postgres stack | Free (on Supabase free tier) | With pg_trgm, manual |
| Qdrant | Self-hosted, 1M–10M vectors, hybrid out of the box | Free self-host, $25/mo Cloud | Yes, native |
| Pinecone | Fully managed, very high QPS, billions of vectors | $50/mo Starter | Yes, native |
| Weaviate | Built-in modules, GraphQL API, hybrid + filters | Free self-host, $25/mo Sandbox | Yes, native |
| Chroma | Local dev, prototyping, demos | Free, open source | Limited |
My default in 2026 is pgvector on Supabase. One service, one bill, one query language, free up to a generous limit. The day you cross a million vectors or need sub-100ms p99 latency at high QPS, migrate to Qdrant if you self-host or Pinecone if you do not want to. Do not start with Pinecone for a prototype; you will pay for capacity you do not use.
Chunking: the part everyone gets wrong
Big chunks dilute meaning; the embedding becomes an average of three topics and matches nothing well. Tiny chunks lose context; the LLM gets fragments without the surrounding sentence. The right size depends on your docs, but a sensible starting point is 300–500 tokens per chunk with 50 tokens of overlap. Always split on semantic boundaries (paragraphs, headings) before falling back to fixed-size windows. The LangChain RecursiveCharacterTextSplitter does this well.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=50,
separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "],
length_function=len,
)
with open("docs/handbook.md") as f:
text = f.read()
chunks = splitter.create_documents(
texts=[text],
metadatas=[{"source": "handbook.md"}],
)
print(f"Created {len(chunks)} chunks")Embeddings: which model and why
OpenAI’s text-embedding-3-small is the cheap default at $0.02 per million tokens — a 10,000-document corpus costs cents to embed. For higher accuracy use text-embedding-3-large ($0.13/M tokens). If you need a fully open-source path, BAAI/bge-large-en-v1.5 from Hugging Face is competitive and runs locally on a single GPU. Pick one and commit; switching models means re-embedding everything.
from openai import OpenAI
client = OpenAI()
def embed_batch(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [item.embedding for item in response.data]
# Embed all chunks in batches of 100 to stay within rate limits
batch_size = 100
all_vectors = []
for i in range(0, len(chunks), batch_size):
batch = [c.page_content for c in chunks[i : i + batch_size]]
all_vectors.extend(embed_batch(batch))
print(f"Embedded {len(all_vectors)} chunks at dim {len(all_vectors[0])}")Storage and retrieval with pgvector
import psycopg
from pgvector.psycopg import register_vector
conn = psycopg.connect("postgresql://user:pass@localhost/ragdb")
register_vector(conn)
conn.execute("""
CREATE TABLE IF NOT EXISTS docs (
id BIGSERIAL PRIMARY KEY,
source TEXT,
content TEXT,
embedding VECTOR(1536)
);
CREATE INDEX IF NOT EXISTS docs_embedding_idx
ON docs USING hnsw (embedding vector_cosine_ops);
""")
# Insert chunks
for chunk, vec in zip(chunks, all_vectors):
conn.execute(
"INSERT INTO docs (source, content, embedding) VALUES (%s, %s, %s)",
(chunk.metadata["source"], chunk.page_content, vec),
)
# Retrieve top 5 for a question
question_vec = embed_batch(["What is our refund policy?"])[0]
rows = conn.execute(
"SELECT source, content FROM docs ORDER BY embedding <=> %s LIMIT 5",
(question_vec,),
).fetchall()
conn.commit()Building the grounded prompt
type RetrievedChunk = { source: string; content: string };
export function buildPrompt(
question: string,
chunks: RetrievedChunk[],
): string {
const context = chunks
.map((c, i) => `[${i + 1}] (${c.source})\n${c.content}`)
.join("\n\n");
return `You answer questions using ONLY the context below.
If the answer is not in the context, say "I don\u2019t have that information."
Always cite sources by their [number].
--- CONTEXT ---
${context}
--- END CONTEXT ---
Question: ${question}
Answer:`;
}The story I tell every team before they start
October 2024, a Wednesday afternoon, a twelve-person SaaS company in Cape Town asked me to build a customer-facing chatbot over their 800-page product documentation. The first version returned the right chunks 80% of the time, and the LLM was still wrong about half the time. Why? Chunks were 1,500 tokens each. The embedding for each chunk was the average meaning of three topics, and the LLM was being asked to find one specific fact in a paragraph that talked about pricing, integrations, and support tiers all at once. We dropped chunk size to 350 tokens with 50-token overlap, kept the same embedding model, kept the same LLM. Accuracy on their internal eval set jumped from 64% to 91% in one afternoon. The pipeline did not change. The chunking did. Always tune chunking first.
“RAG is not magic. If your source documents are messy, RAG returns the same mess back to you confidently and with citations.”
The no-code route: building RAG in n8n
If you do not want to write Python, n8n ships a Vector Store node, an Embeddings node, and an AI Agent node that wire together in twenty minutes. Use the Document Loader to ingest, the Recursive Text Splitter for chunking, OpenAI Embeddings, and Supabase Vector Store. The trade-off: you give up control over batch sizes, custom metadata, and re-ranking. For prototypes and internal tools, that is fine. For customer-facing production, write the code. See the related guide on running local LLMs in n8n if you need to keep your data on-premise.
Frequently asked questions
Frequently asked questions
When should I use RAG versus fine-tuning?
RAG when your data changes often or you need source citations. Fine-tuning when you need the model to talk in a specific style or use a specific vocabulary. The two combine: fine-tune for tone, RAG for facts.
What chunk size should I start with for RAG?
300–500 tokens per chunk with 10–20% overlap is a sensible default for most document types. Tune from there by running an eval set; for highly structured docs (legal, medical) go smaller, for narrative content (blog posts) go larger.
How much does it cost to run a RAG pipeline at small scale?
For under 10,000 documents and 1,000 queries a day, expect roughly $5–20 a month in OpenAI embedding + completion costs, plus a free pgvector instance on Supabase. The bill scales linearly with query volume; budget about $0.001–0.005 per answered question with GPT-4o-mini.
Do I need a vector database, or can I use plain SQL?
You need vector similarity search, which Postgres handles via the pgvector extension. So "plain SQL + pgvector" counts as a vector database for our purposes. Pure Postgres without pgvector cannot do approximate nearest neighbour search efficiently at scale.
Why are my RAG answers still wrong even with good retrieval?
Three usual culprits: chunks are too big and dilute meaning; no re-ranker on top of vector search; or the prompt does not instruct the model to refuse when the context is insufficient. Fix in that order.
Can RAG work with local, self-hosted LLMs?
Yes. Swap the OpenAI client for an Ollama or vLLM endpoint and use a local embedding model like BAAI/bge-large-en-v1.5. Latency goes up, privacy goes up, cost drops to electricity. See the local LLM guide for the wiring.