Connect a Local LLM to n8n: A Step-by-Step Tutorial

Q: How do I keep the LLM from making up data in extraction tasks?

Pin the schema in the system prompt, use Ollama’s "format": "json" mode, add a "never invent values" instruction, and validate every response with a code node before the rest of the workflow trusts it. Route invalid responses to a review queue, not the bin.

7 min read · Updated Mar 30, 2026

You connect a local LLM to n8n by running Ollama on the same machine (or same Docker network), pointing n8n’s HTTP Request or Ollama Chat Model node at http://localhost:11434, and asking the model for structured JSON. That covers 90% of what people actually want to do. The other 10% — keeping the JSON parseable on a 7B model, surviving an n8n Cloud setup, and not melting your GPU — is where this guide spends its time.

Key takeaways

Ollama exposes an OpenAI-compatible API at port 11434, which is the cleanest way to plug a local model into n8n.
For extraction and classification, an 8B model in 4-bit quant is usually enough and fits in about 8GB of VRAM.
Asking nicely for JSON does not work on small models; pin the schema in the system prompt and validate every output.
n8n Cloud cannot reach your laptop without a tunnel — self-host n8n on the same Docker network if you want a stable setup.
Local LLMs are ready for batch pipelines today. They are not ready to write your customer-facing email.

Why bother running an LLM locally for n8n?

There are two honest reasons. Cost and privacy. Everything else is a story you tell yourself.

On cost: OpenAI lists GPT-4o input at $2.50 per million tokens and output at $10 per million tokens on its 2024 pricing page. That sounds like nothing. Then you point an n8n loop at a folder of PDFs, leave it overnight, and the bill is $40 before you have had coffee. Llama 3 8B running on a GPU you already own does the same parsing for the cost of the electricity.

On privacy: a lot of the documents people actually want to extract data from — invoices, contracts, support transcripts, internal docs — are exactly the documents legal would prefer you did not paste into a third-party API. A local model never leaves the box. That is the whole pitch.

What you need before you start

A machine with at least 8GB of free RAM (CPU-only) or 8GB of VRAM (for a 4-bit quantised 8B model). 16GB on either side is a much nicer life.
Ollama installed. It runs on macOS, Linux, and Windows and bundles its own model server.
n8n. Either self-hosted (Docker is fine) or n8n Cloud if you are willing to tunnel back to your local box. Self-hosted is simpler.
A small test payload — one invoice, one email, one form submission. Do not start with a thousand records. Start with one you can read end to end.

Step 1: get Ollama running with a model n8n can actually use

Install Ollama, pull a model that can follow instructions reliably, and confirm the OpenAI-compatible endpoint is up. llama3:8b is the default I reach for. It is small enough to run on a laptop GPU, instruction-tuned, and the JSON-mode output is good enough to build on. mistral:7b-instruct is the runner-up.

bash

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Pull an instruction-tuned model that handles JSON well
ollama pull llama3:8b

# Start the server (auto-starts on macOS via the app)
ollama serve

Install Ollama and pull a model that will not hallucinate JSON.

Sanity-check the OpenAI-compatible endpoint

Ollama exposes both its native /api/chat route and an OpenAI-compatible /v1/chat/completions route on port 11434. n8n speaks both. The OpenAI-compatible one is the path of least resistance because every example you find online maps onto it cleanly.

bash

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "messages": [
      {"role": "user", "content": "Reply with the word PONG."}
    ]
  }'

If this returns a JSON body, your local LLM is reachable.

Step 2: wire n8n to your local Ollama

There are two real setups. Self-hosted n8n in Docker on the same machine, or n8n Cloud tunnelling back to your local box. Self-hosted wins for almost everyone.

The Docker setup that just works

The n8n team publish a self-hosted-ai-starter-kit on GitHub that bundles n8n, Ollama, Qdrant, and Postgres into one Compose file. It is the cleanest reference build and the one I would start from. If you want a smaller version, the snippet below is the minimum that lets n8n talk to Ollama on the same Docker network.

yaml

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama

  n8n:
    image: n8nio/n8n:latest
    ports:
      - "5678:5678"
    environment:
      - N8N_SECURE_COOKIE=false
      # n8n reaches Ollama by service name, not localhost
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - n8n_data:/home/node/.n8n
    depends_on:
      - ollama

volumes:
  ollama:
  n8n_data:

Minimal docker-compose to run n8n alongside Ollama on the same network.

If you insist on n8n Cloud

Step 3: get clean JSON out of a 7B model

This is the part that quietly breaks everyone’s first build. A 7B or 8B model will happily wrap your JSON in friendly prose — "Sure, here is the data you asked for: { ... }" — and the next n8n node will explode trying to parse it. The fix is not asking nicely. The fix is pinning the schema in the system prompt, demanding JSON only, and validating the output before you trust it.

I had to learn this the hard way. One Friday afternoon in May 2024 I left a freshly built invoice-extraction workflow chewing through 47 PDFs. By 8 p.m. the n8n executions panel was a solid red bar. Forty of the runs had failed at the JSON parser node because the model had added one chatty sentence before the opening brace. The fix took six lines in the system prompt. The next batch of 200 ran clean.

json

{
  "model": "llama3:8b",
  "format": "json",
  "messages": [
    {
      "role": "system",
      "content": "You are an extraction service. Reply with a JSON object only. Do not add prose, code fences, or commentary. Schema: { \"invoice_number\": string, \"total_amount\": number, \"currency\": string, \"due_date\": string (ISO 8601), \"line_items\": [{ \"description\": string, \"quantity\": number, \"unit_price\": number }] }. If a field is missing, return null. Never invent values."
    },
    {
      "role": "user",
      "content": "<<< paste invoice text here >>>"
    }
  ]
}

A system prompt shape that gets parseable JSON out of an 8B model.

Three things matter in that prompt. The "format": "json" flag, which Ollama added in early 2024 and which forces the response into JSON mode at the runtime level. The explicit schema in the system message, including types. And the "never invent values" line, which is the closest thing a small model has to a hallucination brake.

Local LLM vs cloud API: when each one wins

How I actually choose between a local model and a cloud API for n8n workflows.
Use case	Local LLM (llama3:8b)	Cloud API (GPT-4o)
Invoice and form extraction	Strong fit. Cheap, private, fast.	Overkill at $2.50/M input.
Support-ticket classification	Strong fit once you tune the prompt.	Faster to ship, more expensive at scale.
Customer-facing email copy	Avoid. Output reads stiff.	Wins on prose quality.
Multi-step reasoning over long docs	Limited by context window and quality.	Wins. Long context is a real edge.
Cost at 10k runs/month	Electricity and a one-off GPU.	Roughly $30 to $200 depending on tokens.

The mistakes that will burn you

Pointing n8n at localhost:11434 from inside its Docker container. The container’s localhost is itself. Use the service name (ollama) or host.docker.internal.
Loading a 70B model on 16GB of VRAM. It will fall back to CPU, take 90 seconds per run, and you will blame the prompt. Match the model to the hardware.
Skipping the JSON validator node. The model will eventually send back malformed output. Catch it, log it, and route the failure to a "needs review" branch instead of letting it kill the workflow.
No max-iterations cap on agent loops. An LLM node inside a Loop Over Items can rerun thousands of times when something upstream changes shape. Set a limit. Always.

“The first time a local-LLM workflow runs clean for 24 hours straight, you stop trusting cloud APIs for that job. Then the boring parts of your stack start looking suspicious too.”

Frequently asked questions

Can n8n use a local LLM without Docker?

Yes. Install Ollama directly on the same machine as n8n, then point the n8n node at http://localhost:11434. Docker only matters when n8n itself runs in a container, because then "localhost" means the container.

What is the cheapest hardware that runs llama3:8b well?

Any GPU with 8GB or more of VRAM handles llama3:8b in 4-bit quant comfortably. A used RTX 3060 (12GB) is the price-to-performance sweet spot at the time of writing. CPU-only works for low volume but feels slow above one request every few seconds.

Can I use n8n Cloud with Ollama on my own machine?

Only if you expose Ollama to the internet through a tunnel like Cloudflare Tunnel, ngrok, or Tailscale Funnel, and put auth in front of it. Most people who try this end up self-hosting n8n on the same box as Ollama because it is two fewer moving parts.

How do I keep the LLM from making up data in extraction tasks?

Pin the schema in the system prompt, use Ollama’s "format": "json" mode, add a "never invent values" instruction, and validate every response with a code node before the rest of the workflow trusts it. Route invalid responses to a review queue, not the bin.

Is a local LLM good enough to replace OpenAI in production?

For extraction, classification, summarisation, and rewrites, yes, today. For long-context reasoning, agent loops, or customer-facing prose, a frontier cloud model still wins. The honest answer is that most production workflows are a mix, and that is fine.

Will llama3:8b fit on a Mac?

Yes, on Apple Silicon with 16GB of unified memory it runs comfortably. M-series GPUs are well supported by Ollama out of the box. Intel Macs without a discrete GPU can run it but the inference speed makes it painful for anything beyond demos.