Build a Self-Correcting RAG Pipeline with n8n and a Local LLM

5 min read · Updated Jun 4, 2026

A self-correcting RAG pipeline retrieves chunks from your vector store, asks the LLM to answer, then asks a second LLM call to grade whether the answer is actually grounded in the retrieved chunks. If the grade is low, it re-retrieves with a rewritten query and tries again, up to a small cap. In n8n with a local LLM, the whole loop is six nodes.

Key takeaways

Cap the self-correction loop at 2 retries. Past that, latency doubles and accuracy gains vanish.
Use a DIFFERENT prompt for the critic than the generator — a critic that thinks like the generator agrees with it.
Run with a local LLM (Ollama + DeepSeek-r1:8b or Llama 3 8B) when your documents contain sensitive data — no API call leaves your network.
Re-rank retrieved chunks BEFORE the first answer with a cross-encoder — 70% of "self-correction" wins are actually retrieval wins.
Log every (question, retrieved chunks, answer, critique) tuple to Postgres. That data is your next prompt revision.

Quick definition

RAG, short for retrieval-augmented generation, is the pattern where you fetch relevant text from your own documents and stuff it into the LLM's prompt so the model can answer from your data instead of guessing. A vector store is a database that indexes text by meaning rather than keywords; Qdrant, Chroma, and Postgres with pgvector are the common picks.

The day my pipeline started lying confidently

In June 2024 I shipped a RAG bot for an internal docs corpus, around 4,200 markdown files. It worked beautifully in testing. Three weeks later a colleague named Priya pinged me a screenshot. The bot had told her our retention policy was 90 days. The actual policy, written in the very docs the bot was supposed to read, was 30 days. The retrieval had pulled the wrong section. The model had paraphrased it confidently. Nobody noticed for nineteen days. That is the moment I started building grading into every pipeline. The fix took an afternoon. The trust took three months.

The opinion I will defend

Plain RAG without a grader is worse than no RAG, because it launders bad retrieval into confident prose. The mechanism: the model has no signal that it should refuse. Add a grader and you give it that signal. The cost of being wrong about this is that you ship a system that hallucinates inside your own data and your users learn to distrust it. I would rather have a bot that says I do not know 8% of the time than one that lies 8% of the time. Hold this loosely for low-stakes use cases like brainstorming, where a hallucination is just a bad suggestion.

n8n workflow canvas showing a retrieval node feeding a generator node that loops back through a grader node

The six nodes

Webhook — receives the user question.
Embedding HTTP Request — calls your local embedding model (nomic-embed-text via Ollama works well) to turn the question into a vector.
Vector Store query — Qdrant or Postgres with pgvector. Return the top 5 chunks.
Generator LLM — local Llama 3.1 8B with a tight prompt: answer only from the chunks, cite the chunk index, say I do not know if unsupported.
Grader LLM — second call to the same model with a different prompt: given the question, the answer, and the chunks, return a JSON object with grounded (true or false) and a one-sentence reason.
IF node — if grounded is false and attempts is under 2, rewrite the query through a third short LLM call and loop back to the embedding step. Otherwise, return.

The grader prompt that actually works

Keep it short. Ask the model to compare claims in the answer against the retrieved chunks, not against its own knowledge. The exact phrasing I use: "Return JSON only. Schema: {grounded: boolean, reason: string}. grounded is true only if every factual claim in ANSWER appears in CHUNKS. Disregard your own knowledge." Long graders drift. Short graders judge.

Numbers from my own runs

On a 4,200-document internal corpus, plain RAG got 71% of answers correct against a hand-graded eval set of 200 questions. The same pipeline with a grader and one retry got 86%. With two retries, 89%. Past two retries the gains vanished and latency doubled. Caveat: this is one corpus, my eval set, my prompts. Treat the numbers as a best guess at the shape of the improvement, not a benchmark.

What it costs to run

On a single RTX 3060 with Ollama, average end-to-end latency landed at 2.1 seconds for the happy path and 4.4 seconds when the grader triggered a retry. Throughput around 20 concurrent users before the GPU saturated. If you need more, the same workflow scales horizontally — point n8n at a pool of Ollama instances behind a small round-robin.

The mistake I keep seeing

People put the grader call on the same model with the same prompt and wonder why it agrees with itself. Use a different prompt. Better: use a slightly different model. I sometimes run the generator on Llama 3.1 8B and the grader on Qwen 2.5 7B. They disagree more, which is what you want. A grader that always agrees is decoration.

What the other guides get right and what they miss

The official n8n blog post at blog.n8n.io/rag-pipeline lays out the visual workflow and the Inferensys guide on agentic RAG explains the verification-agent pattern well. Jawwad Ali's open-source self-healing-rag repo on GitHub ships an end-to-end implementation on n8n plus Neon Postgres plus OpenAI, with no Python runtime, which is the cleanest reference build I have seen. What none of them quantify is the latency hit of the grader retry, and almost none of them suggest using a different model for grading than for generating. Those two choices are what make the loop trustworthy without making it slow.

Frequently asked questions

Do I need a vector store or can I just keyword-search?

Hybrid is best. Pure vector search misses exact identifiers like order numbers and SKUs. Pure keyword misses paraphrases. Most production RAG I see now runs BM25 and vector in parallel and merges with a reranker. Qdrant added native hybrid in its 1.10 release in mid-2024.

What about LangGraph or LlamaIndex instead of n8n?

Use them if your team writes Python every day. n8n wins when non-engineers need to see and tweak the flow, and when the rest of your automations already live in n8n. The patterns are the same; the canvas is different.

How big should each chunk be?

Start at 500 tokens with 50 token overlap and adjust by reading actual bad answers. For technical docs I usually end up around 350. For long-form prose, 700. There is no universal number.

Can the grader run on the OpenAI API while the generator stays local?

Yes, and it is a sensible split if you trust the cloud for the grading prompt but not for the data itself. The grader sees the chunks though, so think about what is in them before you ship that hybrid setup.

Build the loop on a Saturday. Ship it Monday. Watch the wrong answer rate drop on Friday. That is the entire arc.