Add Authentication and Rate Limiting to Your AI Micro-SaaS API
Add auth to your AI API by issuing scoped API keys, storing only their hashes, and checking the hash on every request. Add rate limiting by counting requests per key in Redis with a sliding window, and rejecting with HTTP 429 plus a Retry-After header when the count exceeds the limit. Both pieces fit in under 150 lines of code in any modern stack.
Three terms in one line each
API key: a long random string the client sends in an Authorization header. Hash: a one-way function (use bcrypt or argon2id, not SHA-256, because hashes leak otherwise). Sliding window rate limit: counts requests in a rolling time bucket rather than resetting on the minute.
The bill I do not want to repeat
August 2024. I had a small open beta for an AI text-cleaning API. I had not put a rate limit on the free tier because, surely, nobody would abuse a beta with 12 users. A scraper found the endpoint on a Wednesday at 03:14 and ran it 47,200 times before my AWS billing alert fired at 09:00. The OpenAI bill for those six hours: $312. I emailed support, they were kind, the charge stuck. The fix was a Redis-backed sliding window limiter and an API key requirement. Twenty minutes of code. I now write the limiter before I write the endpoint.
The opinion
Every public AI endpoint should require an API key from day one, even in beta, even free. The mechanism: open endpoints get scraped within hours, and the cost of an abused LLM endpoint dwarfs the cost of any other abused endpoint by two orders of magnitude. The cost of being wrong is your monthly burn for a single afternoon. Hold this loosely only for endpoints that do not touch an LLM at all.
Auth, the minimum that actually holds
- Generate keys with 32 bytes of cryptographic randomness, prefix them with something readable like sk_live_ so they are findable in logs, and show them to the user exactly once at creation.
- Store only the bcrypt or argon2id hash in your database, never the plaintext. On request, hash the incoming key with the same algorithm and compare in constant time.
- Attach a key_id (a short, non-sensitive identifier embedded in the prefix or returned separately) so you can look up the right hash without scanning the whole table on every request. This single design choice is the difference between a lookup that takes microseconds and one that does not scale.
- Add scopes: read, write, admin. Default to the narrowest. Most API keys leak through screenshots and chat logs; scoped keys make leaks survivable.
Rate limiting, the sliding window
Redis with a single ZSET per key. On every request, add a member with the current Unix millisecond timestamp as the score, remove members older than the window, then ZCARD to get the count. If the count exceeds the limit, return 429 with Retry-After set to the time until the oldest in-window request ages out. The whole thing is five Redis commands and runs in under a millisecond on a local Redis.
Tiered limits that match real usage
- Free tier: 60 requests per minute, 1,000 per day. Enough to evaluate, not enough to scrape.
- Paid tier: 600 requests per minute, no daily cap. Most paying users never approach this.
- Burst budget: allow 20% over the per-minute limit for a 10-second window. Smooths out legitimate spikes without giving up the ceiling.
What 429 should look like
Return JSON with code, message, and a retry_after field in seconds. Set the Retry-After HTTP header to the same value. Per RFC 6585, 429 with Retry-After is the standard, and respected by most HTTP client libraries' automatic retry logic. Skipping the header makes well-behaved clients hammer you anyway.
The audit log you will be glad you wrote
Every key creation, every key rotation, every key revocation, every 429 event. Write them to a structured log with the key_id, the actor, the IP, and the timestamp. The first time you have a customer claim their key was misused, you will read this log for thirty seconds and have an answer.
Sliding window vs token bucket, and the packages worth using
The four algorithms you will see in production are fixed window, sliding window, token bucket, and leaky bucket. The c-sharpcorner deep-dive on Node.js rate limiting and the HttpStatus.com strategy guide both arrive at the same recommendation: sliding window for fairness across burst traffic, token bucket when you explicitly want to allow short bursts above the average rate. In Express the practical stack is express-rate-limit (the package) backed by Upstash Redis or rate-limit-redis (so the counter survives across processes), plus helmet for security headers. Gitanjali's api-rate-limiter repo on GitHub shows the full Express + MongoDB + JWT + Upstash combination at ten requests per minute, which is a sensible reference build. Whichever algorithm you pick, return 429 with a Retry-After header. The client SDK retries will respect it, and your incident channel will be quieter for it.
Frequently asked questions
JWT or API keys?
API keys for machine-to-machine. JWTs are useful for user sessions in a frontend; they are overkill and harder to revoke for server-to-server. Most AI APIs are the second case. Use the simpler tool.
Can I use a SaaS for this?
Yes. Kong, Tyk, Cloudflare API Shield, and Unkey all offer key management and rate limiting as a service. Unkey in particular is built for this exact use case and the free tier covers most small SaaS shapes. If your throughput is high or you want one less vendor, the DIY version in Redis is straightforward.
What about per-IP limits?
Add them as a second layer in front of the per-key limit. Per-IP catches the case where someone uses many free keys from a botnet; per-key catches the case where one honest key is being misused. Both are cheap and they compose.
How do I rotate a leaked key?
Mark it revoked in the database, return 401 on the next request, and email the owner with a one-click link to create a replacement. Build the rotation flow before you need it. The day a key leaks publicly is not the day to design the recovery UX.
Twenty minutes of auth and limiting before launch saves the four-figure invoice later. Write the limiter first. Write the endpoint second.