Deploy Your Micro-SaaS AI API: From Localhost to Production

5 min read · Updated Jun 4, 2026

To deploy a small AI SaaS API to production you need four things: a container, a host that runs the container behind HTTPS, a queue or background worker for any LLM call that takes longer than two seconds, and a way to watch what breaks at 2am. Fly.io, Railway, Render, or a $12 VPS with Caddy will all do the job. Pick the one you will actually log in to.

Key takeaways

Avoid serverless free tiers for AI APIs. Cold starts of 10–20s tank user experience; an always-on $6–12/mo VM is the cheapest fix.
Move any LLM call >2s to a background worker + job-status endpoint. Synchronous HTTP requests die at the platform timeout (10s on Vercel, 30s on most edge runtimes).
Ship error tracking (Sentry) and request logging (Axiom or Better Stack) on day one. Without observability, debugging at 2am is impossible.
Always rate-limit /api/generate at the edge (Upstash Ratelimit) before the request hits your LLM provider — stops bot abuse from running up bills.
Use Stripe Customer Portal so users self-manage subscriptions — friction here generates support tickets, not retention.

The night I learned about cold starts

December 2023, I had a tiny AI summarisation API serving about 40 paying users. I deployed it to a serverless platform because the free tier was free. A user named Marcus emailed me on a Tuesday night: "every request takes 14 seconds, is the model down?" The model was fine. The platform was spinning a fresh container for every request because nobody had hit the endpoint in 15 minutes. The fix was moving to a tiny always-on VM, which cost me $6 a month instead of $0. Marcus stayed. I learned that free is sometimes the most expensive option you can pick.

The opinion most indie hackers will not like

For a sub-$10k MRR AI SaaS, serverless is the wrong default. Cold starts ruin perceived quality, per-invocation pricing makes streaming responses awkward, and the savings disappear the moment your traffic looks anything like real. A $12 always-on VPS with a single Caddy reverse proxy, a Docker container, and a Postgres-backed queue will out-serve most Lambda setups for the first two years. The mechanism: LLM responses are heavy, often streamed, and benefit from a warm process. The cost of being wrong about this is that your users pay the cold-start tax and you lose them quietly. Hold this loosely once you cross spiky bursty traffic patterns; then serverless earns its keep.

Server status dashboard showing API request latency and a queue depth graph for an AI inference endpoint

The minimum production stack

API layer — FastAPI or Hono. Both have native streaming and proper async.
Container — Docker with a slim base image. Pin your Python or Node version.
Host — Fly.io for global, Railway for simplest deploys, Hetzner for cheapest. Per Hetzner's pricing page as of late 2025, a CX22 with 2 vCPU and 4 GB RAM is around €4.51 a month.
Queue — Redis with BullMQ if you live in Node, or RQ if Python. Anything LLM-bound goes through the queue.
Database — Postgres. Managed if you value sleep, self-hosted if you value the $20 a month.
TLS and reverse proxy — Caddy. It does automatic HTTPS in three lines of config.

Why a queue is non-negotiable

An LLM call that takes nine seconds will time out behind most load balancers and break most HTTP clients. The pattern: the API accepts the request, drops a job into the queue, returns a job id, and the client either polls or subscribes to a Server-Sent Events stream for the result. This also means your API process stays responsive when one user uploads a 200-page PDF for summarisation.

Concrete numbers from a small SaaS

A real example from a side project I run: 312 paying users at $19/month, average 1,400 LLM calls a day routed to OpenAI's GPT-4o-mini and Claude Haiku. Monthly infra bill, October 2025: $11 Hetzner VPS, $9 managed Postgres on Neon, $0 Cloudflare DNS and edge, $4 backups to a Backblaze B2 bucket. Total $24. AI API spend the same month: $189. The lesson is that infrastructure is rarely the line item that hurts.

Observability you will actually look at

Three things, no more on day one. Structured JSON logs going to a single place — Axiom, Better Stack, or a self-hosted Loki. An uptime check on the health endpoint every 60 seconds. A Slack or email alert when the queue depth crosses some number you guessed once and will tune later. Skip the fancy dashboards until you have a real incident; they only look useful in screenshots.

The three things that will go wrong first

An LLM vendor rate-limits you mid-request. Add exponential backoff and a clear user-facing error.
A user uploads something enormous. Cap input size at the API layer, not after embedding it.
Your container restarts and the queue forgets in-flight jobs. Use a persistent broker (Redis with AOF, or Postgres-based) and idempotent job handlers.

Idempotency in one sentence

If a client retries the same request, your handler must produce the same result and not double-charge. Practically: hash the request body, store the hash with the response, and return the cached response on a repeat hash within some window. This single pattern has saved me more support tickets than any other line of code I have written since 2015.

The LLM gateway pattern, and what Railway's guide gets right

An LLM gateway is a small service that sits between your API and the model providers, centralising API key rotation, request caching, rate limiting per customer, and provider failover. The Railway team's official deploy-an-AI-powered-SaaS guide (docs.railway.com/guides/deploy-ai-saas) and the n1n.ai write-up on production LLM gateways both make the case that this layer pays for itself the first time a provider has an outage or doubles a price. Machine Learning Mastery's architecture roadmap covers the same ground from the Kubernetes angle. If you take one thing from those guides plus this one: do not call the model provider directly from your handler, even on day one. Wrap it. Future you will not have to refactor under fire.

Frequently asked questions

Do I need Kubernetes?

No. If you are asking, you do not. A single VPS with Docker Compose handles tens of thousands of users for most AI SaaS shapes. Kubernetes becomes worth its operational cost somewhere past a multi-team engineering org.

Cloudflare Workers AI or self-host?

Workers AI is genuinely good for short, cacheable, low-context calls at the edge. For anything streaming a long response or running a multi-step agent loop, self-hosted on a regular VM is simpler and more debuggable.

How do I handle secrets?

Environment variables loaded from a sealed file (SOPS, age, or your platform's built-in secret manager). Never commit a .env. Rotate the OpenAI key the first time someone leaves the team, every time.

What is the cheapest way to start?

Hetzner CX22 plus Cloudflare proxy plus a free Neon Postgres branch plus Caddy for HTTPS. Total monthly: under $6. You can be live by lunchtime.

Ship the boring version on Friday. Add the second instance the first time you wake up to an alert. Most products never need the third.