Private AI Automation Workflows: A Self-Hosted Guide
9 min read · Updated Mar 30, 2026
Private AI automation means every prompt, every token, every embedding stays inside infrastructure you control. No round-trip to OpenAI. No data-processing addendum to negotiate. No "we may use your data to improve our models" footnote to argue about with legal. In 2026 this is finally a practical option for normal teams — not just defence contractors — because consumer-grade GPUs run useful 70B models, Ollama makes the API match OpenAI’s shape, and n8n self-hosted ties it together with no licence drama. This article walks the architecture, the hardware reality, the cost math, and the compliance overlay that turns a self-hosted setup into a defensible one.
Key takeaways
- A laptop with 8 GB VRAM runs Llama 3 8B (4-bit) usefully; 24 GB VRAM unlocks Llama 3 70B 4-bit quality.
- Ollama listens on
localhost:11434with an OpenAI-compatible chat endpoint — most n8n nodes drop in unchanged. - Self-hosted breakeven vs OpenAI: around 30M output tokens/month at OpenAI 2024 prices — below that, cloud is cheaper; above, on-prem wins.
- "The model is local" is not the same as "compliant." Audit logging, access control, encrypted volumes, deletion schedules — those are the compliance bar.
- Quality gap to GPT-4o is real but narrowing. Plan for hybrid: local for sensitive data, cloud for the hardest reasoning steps.
When private AI is actually worth it
Private AI is the right answer when your data is subject to a regulated regime (HIPAA, GDPR Article 6/9, SOC 2 confidentiality criteria, ISO 27001), when your customer contracts forbid third-party AI processing, or when your data is your moat (proprietary research, M&A documents, source code with IP encumbrances). It is the wrong answer when you have no regulatory pressure, your volumes are small, and your engineering team is already stretched — in that case OpenAI’s zero-data-retention API and a strong DPA is the rational choice. Decide based on policy and volume, not vibes.
The reference architecture
The minimal stack: one Linux host with a consumer or workstation GPU, Docker, three containers — Ollama for inference, n8n for orchestration, and Postgres for both n8n’s state and an audit table. Optional fourth container: ChromaDB (or Qdrant) for RAG. Everything talks over a private Docker network. The host’s only inbound port is SSH and the n8n UI (behind your VPN or reverse proxy + SSO). Nothing reaches out to a third-party AI provider.
# docker-compose.yml — minimal private AI stack
services:
ollama:
image: ollama/ollama:latest
volumes:
- ./ollama:/root/.ollama
ports:
- "127.0.0.1:11434:11434" # bind to localhost only
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
n8n:
image: n8nio/n8n:latest
environment:
- N8N_HOST=n8n.internal.example
- DB_TYPE=postgresdb
- DB_POSTGRESDB_HOST=postgres
- N8N_ENCRYPTION_KEY=${N8N_ENCRYPTION_KEY}
volumes:
- ./n8n:/home/node/.n8n
depends_on: [postgres, ollama]
postgres:
image: postgres:16
environment:
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
volumes:
- ./postgres:/var/lib/postgresql/data # put on encrypted volumeHardware sizing without the marketing
| VRAM | Best local model | Throughput (tok/s, single user) | Use case |
|---|---|---|---|
| 8 GB | Llama 3 8B / Mistral 7B | ~40–60 | Classification, summarisation, extraction |
| 16 GB | Llama 3 8B FP16 / Phi-3 14B | ~30–50 | All of the above + drafting |
| 24 GB (RTX 3090/4090) | Llama 3 70B 4-bit | ~10–16 | Reasoning, RAG, instruction-following near GPT-4 |
| 48 GB+ (A6000 / 2x 3090) | Llama 3 70B 8-bit / Mixtral 8x22B | ~6–12 | Highest local quality, multi-tenant |
A used RTX 3090 with 24 GB VRAM is the sweet spot for a single-team setup in 2026 — around $700–$900 second-hand. It runs Llama 3 70B 4-bit at usable speed and matches GPT-3.5 quality for most automation tasks. CPUs matter much less than VRAM; you can host the whole stack on a refurbished workstation. Model weights are downloaded once via ollama pull llama3:70b and cached on disk forever.
Cost math: when does on-prem beat the cloud?
Take OpenAI 2024 pricing for GPT-4o-mini at $0.60/M output tokens (OpenAI pricing page). A used RTX 3090 + workstation amortised over 3 years is roughly $35/month including electricity. At ~10 output tokens/sec sustained, that machine can produce ~25M output tokens/month if hammered 24/7. So the rough breakeven where on-prem becomes pure cost win: ~30M tokens/month sustained. Below that volume, OpenAI is cheaper after you account for ops time. Above that — or if compliance is the driver regardless of cost — self-hosted wins. Most regulated teams are buying the architecture, not the savings, and that’s fine.
The compliance overlay (the part most guides skip)
"The model is local" is necessary but not sufficient. To satisfy a real auditor, you need four things: (1) an audit table that logs every prompt with timestamp, user, model, hashed input, hashed output — sample schema below; (2) authentication on the Ollama and n8n endpoints (reverse proxy + SSO, never raw 0.0.0.0); (3) encrypted disk volumes for Postgres + Ollama caches (LUKS on Linux, FileVault on macOS hosts); (4) a documented retention policy with a real deletion job that actually runs. If you cannot answer "who saw this prompt and when?" in one SQL query, you are not yet compliant.
-- minimal audit table (Postgres)
CREATE TABLE ai_audit (
id bigserial PRIMARY KEY,
ts timestamptz NOT NULL DEFAULT now(),
workflow_id text NOT NULL,
user_email text NOT NULL,
model text NOT NULL,
prompt_sha256 text NOT NULL,
response_sha256 text NOT NULL,
input_tokens int NOT NULL,
output_tokens int NOT NULL,
data_class text NOT NULL -- e.g. PHI, PII, INTERNAL, PUBLIC
);
CREATE INDEX ai_audit_user_ts ON ai_audit (user_email, ts DESC);
CREATE INDEX ai_audit_class_ts ON ai_audit (data_class, ts DESC);Local vs cloud: the honest comparison
| Dimension | Cloud (OpenAI/Anthropic) | Self-hosted (Ollama + Llama 3 70B) |
|---|---|---|
| Quality on hard reasoning | GPT-4o / Claude 3.5 Sonnet (best) | Llama 3 70B (good, sometimes great) |
| Quality on classification/extract | Excellent | Excellent (parity) |
| Setup time | 20 minutes | 1–2 days |
| Ongoing ops | Zero | Real — driver updates, model upgrades |
| Per-token cost at scale | $0.15–$15/M | Fixed hardware cost, ~$0 marginal |
| Data leaves your network | Yes (mitigated by DPA + ZDR) | No |
| Audit story | Vendor logs + your DPA | Your audit table, your control |
A hybrid pattern most regulated teams actually ship
Pure-local is satisfying but quality-limited. Pure-cloud is fast but compliance-loaded. The pragmatic shape: classify the data first with a small local model, and route by data class. Public/marketing content can go to GPT-4o for top quality. Internal data goes through Llama 3 70B locally. PHI/PII never leaves the network. The classifier itself is local and cheap. This pattern keeps the per-call quality high for non-sensitive tasks while keeping a hard boundary for the rest. The audit table’s data_class column above is what makes this defensible.
The legal-firm story that made me believe
July 2024, Tuesday morning. An 11-person commercial-litigation firm I was advising wanted to triage inbound NDAs and supplier contracts — first-pass redline against their playbook, flag unusual indemnity caps, surface anything missing. They’d done a Claude API pilot that worked beautifully. Then InfoSec read the data-processing addendum and killed it before week three. The constraint was non-negotiable: client data could not transit any third-party AI provider, full stop. We built private. Refurbished workstation, used RTX 3090 (24 GB VRAM, $740), Ubuntu, Docker, Ollama running Llama 3 70B 4-bit, n8n self-hosted, watched folder on the firm’s SMB share. Throughput was the trade-off: ~4 NDAs/hour versus 60+ on Claude. Quality was the second trade-off: Llama missed roughly 1-in-8 of the weird indemnity edge cases Claude caught. The partner added a 5-minute final-pass step. Net win: 30 minutes per NDA versus the previous fully manual baseline, on roughly 6 NDAs/day. October that year they did a SOC 2 readiness audit. The auditor asked "where do the prompts go?" and the principal pulled up the audit table — hashed prompt, model, user, timestamp — and showed the retention job in cron. Zero follow-up questions on that control. That conversation is when private AI stopped being a hobby project for the team and started being part of the firm’s defensibility story. The 30 minutes/NDA was real. The audit conversation was the actual win.
The opinion I will defend
What to do next
If you’re early: install Ollama on the strongest laptop in the office, run ollama pull llama3, point an n8n workflow at it, do one real classification task end-to-end this week. See connecting local LLMs to n8n for the exact node setup. If you’re scaling: move to a dedicated host with a 24 GB GPU, add the audit table above, put authentication in front of every endpoint, and write the retention job. If you’re already there: layer on the data-class routing pattern so cloud models can carry the hardest non-sensitive workloads while local handles everything regulated.
“Private AI isn’t about avoiding the cloud. It’s about being able to answer one question — "who saw this prompt and when?" — without phoning a vendor.”
Frequently asked questions
Frequently asked questions
What is a private AI automation workflow?
A workflow where the language model, orchestrator, and any vector store all run on infrastructure you control — typically Ollama + n8n + Postgres + (optional) ChromaDB on a Linux host with a GPU. No prompt ever leaves your network.
What hardware do I need for private AI?
For experimentation, any laptop with 8 GB VRAM runs Llama 3 8B 4-bit. For a small team, a refurbished workstation with a used RTX 3090 (24 GB VRAM, around $700–$900) runs Llama 3 70B 4-bit at near-GPT-3.5 quality. CPUs and disk matter much less than VRAM.
Is local AI as good as ChatGPT or Claude?
For classification, extraction, summarisation, and structured drafting: roughly at parity in 2026. For the hardest reasoning and longest-context tasks: GPT-4o and Claude 3.5 Sonnet still lead. Plan for hybrid — keep local for sensitive data, route public/non-sensitive work to a cloud model when quality matters most.
Does private AI satisfy HIPAA or GDPR by itself?
No — not on its own. Local inference removes the third-party processor problem, but HIPAA and GDPR still require access control, audit logging, encryption at rest, and documented retention/deletion. Build the audit table, enforce SSO at the proxy, encrypt the volumes, run a deletion cron. Then you have a defensible posture.
How much does a private AI setup cost?
Roughly $800–$1,500 one-time for a refurb workstation + used GPU + small SSD, plus electricity (about $15–$25/month on a 24x7 RTX 3090 idle/light load). Software is open-source: Ollama, n8n (BSL with fair-use), Postgres, ChromaDB are free for internal use. The real ongoing cost is engineering attention.
Can I run a private AI workflow without writing code?
Mostly yes. n8n’s Ollama Chat Model node is point-and-click. The audit-log step is a single Postgres node. The compliance overlay (SSO, encrypted volumes, retention job) does need an engineer or sysadmin once, then it runs itself.