How to Build a Self-Correcting LLM Agent Workflow in n8n

Q: What is a self-correcting LLM agent?

An AI workflow that includes a validation step after each LLM call, and if validation fails, feeds the error back to the model with instructions to fix it. The loop typically caps at 2 retries before escalating to a human triage queue.

Q: How many retries should I allow?

Two, with a hard cap. The data is clear: ~95% of correctable failures self-correct on attempt 2; further attempts add cost and latency without meaningful gain. Past 2 retries, the right call is a human triage channel — not more model calls.

Q: Should I use a code validator or an LLM critic?

Both, in order: code first, LLM critic only if code passes. Code catches 90% of real failures (missing fields, type errors, sums that don’t balance) at zero cost. LLM critics are useful for semantic checks ("is this summary actually about the input?") where code can’t judge.

Q: How do I prevent the critic agreeing with the generator?

Use a different system prompt for the critic, and explicitly tell it "you are the critic, not the generator. Assume the output is wrong until proven otherwise." Models default to agreement if you let them. Forcing adversarial framing flips the failure mode.

Q: What does this cost to run?

At OpenAI 2024 GPT-4o-mini pricing, about $0.0005 per processed invoice/document including expected retries. A workflow handling 4,800/month costs about $2.40 in LLM calls plus your hosting (n8n on a $5 VPS). The historic alternative — human review at $20/hr — is ~340x more expensive.

Q: When should I NOT use a self-correcting loop?

When the task is fully deterministic (use code, not an LLM) or when failures are rare and low-cost (just retry the whole thing once on error). The loop earns its keep when failures are frequent (>5%) and each failure costs real time or money to fix manually.

6 min read · Updated Mar 30, 2026

A self-correcting LLM agent in n8n is a small loop: generate, validate, regenerate with the error injected, cap at 2 retries. That cap is the part the academic papers don’t emphasise enough. Two retries captures most of the lift; three doubles your cost and latency for ~1% extra success. This guide walks the exact n8n node graph, the validator code, the critic prompt, and the production guardrails that turn this from a research idea into a workflow you trust at 2 a.m.

Key takeaways

Cap retries at 2. Past that, latency doubles and the success rate barely moves.
Use a different prompt for the critic than for the generator. A critic that thinks like the generator agrees with it.
Validate with deterministic code (schema, type check, range check) before any LLM critic step. Most errors don’t need a model to catch.
Always log every attempt + every error to a Postgres table. That’s your training data for the next prompt revision.
If you hit retry-cap, fail loudly to a #triage Slack channel — don’t silently accept the last attempt.

The 5-node loop

The minimal self-correcting agent in n8n.
#	Node	Job
1	Trigger	Webhook / Schedule — carries the input + an attempt counter (init 0)
2	AI: Generator	LLM produces structured JSON output
3	Code: Validate	Schema check + business rules — returns `{ valid, errors }`
4	IF	valid → downstream destination; not valid + attempts < 2 → loop back; not valid + attempts >= 2 → triage
5	Set: Build retry prompt	Append previous output + error to original prompt, increment counter

The validator (deterministic, code-only)

Don’t use an LLM as your first validator. Validate with code. A schema check catches 90% of LLM mistakes (missing field, wrong type, value out of range) for $0 and 5ms. Reserve the LLM critic for the harder cases where code can’t tell whether the answer makes sense (e.g., "is this summary actually about the input document?"). Always run code-validate first; only run critic-validate if code-validate passes.

javascript

// Code node: deterministic validator for invoice extraction
const output = $input.first().json;

const errors = [];

// 1. Required fields
for (const f of ["vendor", "invoice_number", "total", "currency", "issue_date", "line_items"]) {
  if (output[f] === undefined || output[f] === null || output[f] === "") {
    errors.push(`missing required field: ${f}`);
  }
}

// 2. Types
if (typeof output.total !== "number") errors.push("total must be a number");
if (!Array.isArray(output.line_items)) errors.push("line_items must be an array");
if (output.line_items?.length === 0) errors.push("line_items cannot be empty");

// 3. Business rules
if (output.line_items?.length) {
  const sum = output.line_items.reduce((a, li) => a + (Number(li.amount) || 0), 0);
  if (Math.abs(sum - output.total) > 0.02) {
    errors.push(`line items sum (${sum.toFixed(2)}) does not match total (${output.total})`);
  }
}

if (!/^\d{4}-\d{2}-\d{2}$/.test(output.issue_date)) {
  errors.push("issue_date must be YYYY-MM-DD");
}

return [{ json: { valid: errors.length === 0, errors, output, attempts: $json.attempts ?? 0 } }];

The retry prompt: feed the model its own error

text

You produced the following output on a previous attempt. It failed validation.
Fix the SPECIFIC issues listed and return a corrected JSON object.
Do not explain. Return only JSON matching the original schema.

Previous output:
```
{{ JSON.stringify($json.previous_output, null, 2) }}
```

Validation errors:
{{ $json.errors.map(e => "- " + e).join("\n") }}

Original input:
"""
{{ $json.input }}
"""

Optional: the critic LLM for semantic checks

For tasks where code can’t judge quality (summarisation, drafting, classification reasoning) add an LLM critic after code-validate passes. Use a different system prompt than the generator: the critic’s only job is to find faults, not to agree. Models default to "yeah, looks good" if you let them. The critic returns { pass, severity, issues }; severity = critical retries the loop, severity = minor passes through with a flag.

text

You are a strict critic of AI-generated invoice extractions.
You are NOT the generator. Your only job is to find errors.
Assume the generator made a mistake until proven otherwise.

Return ONE JSON object:
{
  "pass": true | false,
  "severity": "critical" | "minor" | "none",
  "issues": ["specific issue 1", "specific issue 2"]
}

Rules:
- Cross-check line_items.amount sums against total.
- Check if vendor name appears verbatim in the input.
- Check if invoice_number appears verbatim in the input.
- If anything in the output is not findable in the input → critical.
- If line items are merged or split vs the input → critical.
- If a date is plausibly correct but format-wrong → minor.

Input:
"""
{{ $json.input }}
"""

Extraction to critique:
```
{{ JSON.stringify($json.output, null, 2) }}
```

The story that nailed down "two retries, no more"

September 2024, Friday morning. A 9-person fintech I was helping was running PDF invoice extraction — GPT-4o + a validator + an open-ended retry loop (no cap). 220 invoices/day. The team had set retries to "until valid" because someone read a paper recommending 5. Two things happened in production: average latency for a "hard" invoice ballooned to 47 seconds (sometimes 90+), and OpenAI bill for the month came in at $1,840 vs the projected $400. We dug into the logs. Of 4,800 monthly invoices, ~210 went past 2 retries. Of those 210, only 12 eventually succeeded (success rate of further retries: 5.7%). The other 198 burned an average of 4.3 retries each before either succeeding or failing — ~850 extra LLM calls for 12 successful corrections, ~$71 per saved invoice. We capped at 2 retries the same afternoon, sent everything else to a #triage Slack channel for human review. Following month: latency p95 dropped from 47s to 18s, OpenAI bill back to $420, human-triage queue averaged 14 invoices/week (15 minutes/wk for a person to skim). The papers say "keep retrying"; the operational reality is "every retry past 2 costs more than it earns".

Production guardrails

Cap retries at 2. Hard cap. Triage on retry-3.
Cap total latency at ~30 seconds. If the loop is still running after 30s, abort and triage. Some failures are infinite without a wall clock.
Log every attempt to Postgres. Columns: input_hash, attempt, output, validation_errors, success. Query weekly to find recurring failure patterns; those are your next prompt revisions.
Never silently accept the last attempt. If retries exhaust, the destination is a triage Slack channel, not the database.
Track success-by-attempt. If attempt-1 is <90% on a stable input distribution, the prompt needs work — don’t mask it with retries.

The opinion I will defend

Cost & latency math

Per request, on GPT-4o-mini at OpenAI 2024 pricing ($0.15/M in, $0.60/M out): generator ~1,500 in + 400 out = ~$0.00046. Validator (code) ~$0. Retry (if needed): ~$0.00046. Optional LLM critic: ~$0.00018. Expected cost per request with ~10% retry rate: ~$0.0005. Latency p50 ~3–4s, p95 with one retry ~9–12s, hard cap 30s with triage fallback. Compare with a human reviewing every extraction at $20/hr / 30s per invoice = $0.17/invoice. The agent is ~340x cheaper at the cost of one engineer-day to wire and prompt-tune.

“A self-correcting agent isn’t a model trick. It’s an engineering pattern: cheap deterministic validation, expensive semantic judgement only where needed, hard caps everywhere.”

Frequently asked questions

What is a self-correcting LLM agent?

An AI workflow that includes a validation step after each LLM call, and if validation fails, feeds the error back to the model with instructions to fix it. The loop typically caps at 2 retries before escalating to a human triage queue.

How many retries should I allow?

Two, with a hard cap. The data is clear: ~95% of correctable failures self-correct on attempt 2; further attempts add cost and latency without meaningful gain. Past 2 retries, the right call is a human triage channel — not more model calls.

Should I use a code validator or an LLM critic?

Both, in order: code first, LLM critic only if code passes. Code catches 90% of real failures (missing fields, type errors, sums that don’t balance) at zero cost. LLM critics are useful for semantic checks ("is this summary actually about the input?") where code can’t judge.

How do I prevent the critic agreeing with the generator?

Use a different system prompt for the critic, and explicitly tell it "you are the critic, not the generator. Assume the output is wrong until proven otherwise." Models default to agreement if you let them. Forcing adversarial framing flips the failure mode.

What does this cost to run?

At OpenAI 2024 GPT-4o-mini pricing, about $0.0005 per processed invoice/document including expected retries. A workflow handling 4,800/month costs about $2.40 in LLM calls plus your hosting (n8n on a $5 VPS). The historic alternative — human review at $20/hr — is ~340x more expensive.

When should I NOT use a self-correcting loop?

When the task is fully deterministic (use code, not an LLM) or when failures are rare and low-cost (just retry the whole thing once on error). The loop earns its keep when failures are frequent (>5%) and each failure costs real time or money to fix manually.