Connect Local LLMs to Automation Workflows
7 min read · Updated Jun 4, 2026
You connect a local LLM to an automation workflow by running Ollama or LM Studio on a machine you own, exposing its OpenAI-compatible HTTP endpoint, and pointing your n8n, Make, or Zapier HTTP step at that URL. Nothing leaves your network. You pay zero per-token fees. The whole setup takes about thirty minutes the first time and five minutes every time after.
Key takeaways
- A 7B or 8B local model handles 80% of automation tasks (classification, extraction, simple rewriting) at one ten-thousandth of the cloud-API cost.
- Run Ollama on a $400 mini-PC or any GPU with 8GB+ VRAM — it exposes an OpenAI-compatible /v1/chat/completions endpoint so existing client code works unchanged.
- Point n8n / Make / Zapier HTTP nodes at http://your-host:11434/v1/chat/completions — no SDK swap needed.
- Always keep the local LLM behind a private network or VPN; Ollama ships with no auth by default.
- Reserve frontier cloud models (GPT-4o, Claude 3.5 Sonnet) for genuinely hard reasoning chains. Routing the easy 80% locally is where the margin lives.
Why I stopped sending automation traffic to OpenAI
In March 2024 I ran a small support pipeline through GPT-4o-mini. Around 12,000 tickets a month, mostly classification and a one-line reply suggestion. The bill came in at $86 for the month. Manageable. Then a client asked me to add summarisation of attached PDFs. Same volume, longer prompts. The next bill was $612. I sat there reading the invoice with a coffee going cold and realised I had built a business model that gets worse the more it works. That week I moved the classification step to a local Llama 3 8B running on an old desktop with a 3060. Bill the next month for that step: $0. Latency actually dropped because there was no network hop.
What a local LLM actually is
A local LLM is a language model whose weights sit on disk on a machine you control, and whose inference runs on that machine's CPU or GPU. There is no API key going to a vendor. The most useful open-weight models as of mid-2025 are Meta's Llama 3.1 (8B and 70B), Mistral's Mistral 7B and Mixtral 8x7B, Microsoft's Phi-3 family, Google's Gemma 2, and DeepSeek's V2 and R1 distilled variants. All of them are free for commercial use under their respective licences. Check the licence before shipping. Llama and Gemma have small carve-outs that almost never matter, but read them once.
The unpopular opinion
Most teams reach for OpenAI or Anthropic for tasks a 7B model would handle on a $400 mini PC. Classification, extraction, simple rewriting, tagging, routing decisions, short summaries — none of these need a frontier model. If your prompt is under 2,000 tokens and your output is under 500, a local 7B or 8B model will give you answers that are good enough at roughly one ten-thousandth of the marginal cost. The mechanism is simple: cloud APIs charge for every token whether the task is hard or trivial; your electricity bill does not care. The cost of being wrong about this is your margin. I have watched two small SaaS products go from 60% gross margin to 30% inside a quarter because nobody tracked the API spend until it mattered. Hold this opinion loosely on tasks that genuinely need reasoning chains over long contexts. There a frontier model still earns its keep.
The three tools that actually work
- Ollama — the default. One install, then ollama pull llama3.1 and you have an OpenAI-compatible server on localhost:11434. Per Ollama's docs as of October 2025, it ships with built-in support for over 100 model variants. Works on macOS, Linux, and Windows.
- LM Studio — a desktop GUI for people who do not live in a terminal. Same OpenAI-compatible endpoint, plus a chat window for sanity-checking prompts before you wire them into a workflow.
- LocalAI — a Docker-first server that speaks the OpenAI API and can also serve Whisper for audio and Stable Diffusion for images. Worth it if you want one endpoint for several modalities.
The minimal n8n wiring
In n8n you have two paths. The first is the native Ollama Chat Model sub-node, which n8n added in its 1.19 release in February 2024. You drop it into an AI Agent or Basic LLM Chain node and point the Base URL at http://localhost:11434. The second is the generic HTTP Request node pointing at http://localhost:11434/v1/chat/completions with the standard OpenAI JSON body. I use the HTTP Request version more often because it gives me explicit control over timeouts and retries, which matters once you put this in front of real traffic.
A request body that works the first time
POST to /v1/chat/completions with a JSON body containing model (for example llama3.1:8b), messages (the standard OpenAI array of role and content objects), temperature (0.2 for classification, 0.7 for drafting), and max_tokens (cap it; runaway generations are the most common silent failure I see). Set the HTTP node timeout to 120 seconds on first run, then tighten it once you know your actual p95.
Hardware you actually need
- For a 7B or 8B model in 4-bit quantisation: 8 GB of system RAM is the floor, 16 GB is comfortable. A consumer GPU with 8 GB of VRAM, such as an RTX 3060, gets you roughly 40 to 60 tokens per second. CPU-only works but expect 5 to 15 tokens per second on a modern laptop.
- For a 70B model in 4-bit: 48 GB of unified memory on Apple Silicon, or a 24 GB GPU like a 3090 or 4090, plus patience. Throughput will be 5 to 20 tokens per second depending on context length.
- For high-throughput production: a single A10 or L4 in a cloud VM runs an 8B model at hundreds of tokens per second and costs roughly $300 a month at sustained use. Often still cheaper than equivalent API spend past a few million tokens a day.
The thing that will bite you
Ollama by default binds to 127.0.0.1, which means your n8n container running in Docker on the same machine cannot reach it. Set OLLAMA_HOST=0.0.0.0:11434 in the Ollama environment, restart it, and reference http://host.docker.internal:11434 from inside the n8n container on macOS and Windows, or the host's bridge IP on Linux. I have spent more hours than I want to admit on this one cable.
How this stacks up against what the top guides say
The official n8n blog guide at blog.n8n.io/local-llm walks you through a Docker Compose stack that bundles n8n with Ollama and Qdrant, which is the cleanest starting point if you want everything in one container set. Apu Chakraborty's dev.to walk-through ("The Ultimate Guide to Running n8n with Ollama LLM Locally Using Docker") and Kardelen Cihangir's PDF-extraction tutorial from May 2025 both reach the same conclusion the official docs do: the friction is networking, not the model. Where my piece pushes further is the cost crossover math and the OLLAMA_KEEP_ALIVE tax. Those two will decide whether the project survives its first month in production.
Frequently asked questions
Is a local LLM actually cheaper than the OpenAI API?
It depends on volume. For the small-batch hobby pipeline running a few hundred calls a day, the OpenAI API at GPT-4o-mini rates (per OpenAI's 2024 pricing page, $0.15 per million input tokens) is cheaper than buying any hardware. The crossover for an 8B-class task tends to land somewhere between one and five million tokens a day depending on your electricity and whether you already own the machine.
Will a local 8B model embarrass me compared to GPT-4o?
On classification, extraction, routing, and short summarisation: no, not at temperature 0.2 with a tight prompt. On open-ended reasoning over long contexts, multi-step planning, or anything code-heavy past a few hundred lines: yes, often. Pick the task, not the model.
Can I run this on a Mac mini?
Yes. The M2 and M4 Mac minis with 16 GB or more of unified memory run 8B models at usable speeds through Ollama. I have one as a permanent inference box behind a Cloudflare Tunnel for a couple of small client workflows.
How do I keep the model up when n8n calls it?
Set OLLAMA_KEEP_ALIVE to a long value such as 24h. By default Ollama unloads the model after five minutes of inactivity, and the first request after that pays a multi-second cold-start tax that will make your workflow look broken.
What about MCP and agentic workflows on local models?
MCP (Model Context Protocol, Anthropic's open spec from late 2024 for connecting LLMs to tools) works fine against local models that support tool calling, which Llama 3.1 and Qwen 2.5 do. Agentic loops — where the model decides which tool to call next — are more sensitive to model quality, so test the loop on your hardest real case before assuming the 8B handles it.
Set this up once on a weekend, point one workflow at it, leave it alone for a month, and look at the API bill that did not arrive. That moment is the whole pitch.