AI Tool Pipelines — Automate Your WorkflowsAI Tool Pipelines

Handling Rate Limits and Retries in Complex API Pipelines

6 min read · Updated Jun 4, 2026

Dashboard showing API rate limit metrics and retry queue

Rate-limit handling is what separates an API integration that works in dev from one that survives Black Friday. The right pattern is small: exponential backoff with jitter, idempotency keys on writes, a hard retry cap, and a dead-letter queue for everything past the cap. This guide gives you the working code, the production guardrails, and the one mistake (no idempotency key) that turns a "we recovered gracefully" into "we double-charged 2,400 customers."

Key takeaways

  • Always read Retry-After and the X-RateLimit-* headers before falling back to your own backoff math.
  • Use exponential backoff WITH jitter (random 0–500ms added) — without jitter, all your clients retry in lockstep and re-stampede the API.
  • Cap retries at 5 and total wait at ~60s. Past that, the work belongs in a DLQ + human review, not more retries.
  • On write operations (POST/PUT/DELETE), generate an idempotency key per logical operation and send it with every retry — prevents duplicate charges / sends / writes.
  • Throttle proactively using token-bucket. Reactive retry-on-429 is the fallback, not the strategy.

The 4-layer model

Build these in order. Layer 1 alone catches 90% of failures; you add the others as scale demands.
LayerPatternCatches
1Exponential backoff + jitter + retry-afterTransient 429/5xx, brief network blips
2Idempotency keys on writesDuplicate charges/sends/writes during retries
3Proactive throttling (token bucket)Prevents hitting the limit in the first place
4Dead-letter queue + alertingPermanent failures, surfaces them to humans for fix

Layer 1: exponential backoff with jitter

typescript
// fetchWithBackoff.ts — ~40 lines, no deps
type Options = RequestInit & { maxRetries?: number; baseDelayMs?: number };

export async function fetchWithBackoff(url: string, opts: Options = {}) {
  const { maxRetries = 5, baseDelayMs = 500, ...init } = opts;
  let attempt = 0;

  while (true) {
    const res = await fetch(url, init);

    if (res.ok) return res;

    const isRetryable = res.status === 429 || res.status >= 500;
    if (!isRetryable || attempt >= maxRetries) {
      throw new Error(`${res.status} ${res.statusText} on ${url}`);
    }

    // 1. Prefer the server’s hint
    const retryAfter = res.headers.get("retry-after");
    let waitMs: number;
    if (retryAfter) {
      const asNumber = Number(retryAfter);
      waitMs = Number.isFinite(asNumber)
        ? asNumber * 1000
        : Math.max(0, new Date(retryAfter).getTime() - Date.now());
    } else {
      // 2. Otherwise exponential backoff with jitter, capped at 60s
      const exp = Math.min(60_000, baseDelayMs * 2 ** attempt);
      const jitter = Math.floor(Math.random() * 500);
      waitMs = exp + jitter;
    }

    await new Promise((r) => setTimeout(r, waitMs));
    attempt++;
  }
}

Layer 2: idempotency keys (the one that prevents disasters)

When a write request retries, the API has no way to tell "is this a brand-new operation, or attempt 2 of the previous one?" Without an idempotency key, the server processes both. If the first request succeeded but your client timed out before reading the response, your retry charges the customer twice. Stripe’s docs are the canonical reference — always pass Idempotency-Key: <stable UUID> on POSTs, generated once at the start of the logical operation and reused across every retry.

typescript
import { randomUUID } from "node:crypto";
import { fetchWithBackoff } from "./fetchWithBackoff";

async function createCharge(amount: number, customerId: string) {
  // Generate ONCE per logical operation, reused across all retries
  const idempotencyKey = randomUUID();

  return fetchWithBackoff("https://api.stripe.com/v1/charges", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.STRIPE_KEY}`,
      "Idempotency-Key": idempotencyKey,
      "Content-Type": "application/x-www-form-urlencoded",
    },
    body: new URLSearchParams({ amount: String(amount), customer: customerId, currency: "usd" }),
  });
}

Layer 3: proactive throttling with a token bucket

Retry-on-429 is reactive: you’ve already exceeded the limit and the API is angry. Token-bucket throttling is proactive: you self-pace below the limit. Use a Redis-backed implementation so multiple workers share the budget. @upstash/ratelimit in TypeScript or redis-cell if you prefer a Redis module. The cost (one Redis round-trip per call, ~1ms) is negligible compared to the 429s and retries you avoid.

Layer 4: dead-letter queue

After max retries are exhausted, the work must go somewhere visible — not silently logged and forgotten. Push the failed request (URL, body, headers, error, attempt count, timestamp) onto a DLQ (Postgres table is fine for low volume, SQS / Kafka for high). Have a daily Slack alert with the DLQ count. The DLQ is what turns "we lost 1,200 orders silently" into "we got a Slack ping at 9am, replayed 1,200 orders by lunch."

The story that taught me to ship idempotency keys before launch

April 2024, Tuesday morning. A 12-person Shopify-app dev was launching a "post-purchase upsell" flow that charged customers a second time via Stripe based on quiz answers. They’d shipped on Friday. Over the weekend, 2,418 customers got charged the upsell. On Monday morning Stripe’s fraud team flagged the account: 38 customers had been charged twice. We pulled the logs. Pattern was clear — the upsell endpoint was on Vercel, occasionally hit Vercel’s 10-second function timeout. The retry logic kicked in (3 retries, no jitter, no idempotency key). Stripe processed the first charge fine, the client timed out reading the response, the client retried, Stripe processed again as a fresh charge. 38 cases where the timing landed exactly wrong. Refund cost: $4,200. Stripe support hours: 11. Customer trust loss: incalculable. Fix took 4 lines — one randomUUID() at the start of the upsell function, passed as Idempotency-Key on every Stripe POST including retries. Deployed Tuesday afternoon. Zero duplicate charges since (across 47,000 upsells). I now write the idempotency key BEFORE the retry logic on every write integration, full stop. The retry logic is the patient. The idempotency key is the surgeon scrubbing in. You don’t do one without the other.

Why jitter actually matters

When to RETRY vs when to FAIL FAST

Not every error is worth retrying.
StatusRetry?Why
429YES (with backoff)Transient rate limit
500, 502, 503, 504YES (with backoff)Transient server-side issue
408 (request timeout)YES (with backoff + same idempotency key)Transient
400 (bad request)NOYour payload is wrong, will fail again identically
401 (unauthorized)NOCredentials wrong, fix and re-deploy
403 (forbidden)NOPermission issue, retry won’t change anything
404 (not found)NOResource doesn’t exist
422 (unprocessable)NOValidation failed, retry will fail identically

The opinion I will defend

“Backoff with jitter is the body. Idempotency keys are the spine. Skip the spine and your retries stand for about a week.”

Frequently asked questions

Frequently asked questions

How many retries should I configure?

Five maximum, with a 60-second total wait cap. Past that, the failure is almost never transient and you’re burning latency for nothing. Hand the work to a DLQ + human review instead of a sixth retry.

What’s the difference between Retry-After and X-RateLimit-Reset?

Retry-After is set by the API when it returns 429 — it’s an explicit "wait this long before trying again." X-RateLimit-Reset is informational, set on every response, telling you when your current rate window resets. Use Retry-After when present (it’s authoritative for the failed request); use X-RateLimit-Reset proactively to pace future requests.

Should I retry GET requests differently from POST?

GETs are naturally idempotent — retry freely with backoff. POST/PUT/DELETE need an idempotency key on every retry to prevent duplicate state changes. PATCH depends on whether the patch is itself idempotent (replacing fields = yes; appending to a list = no).

When should I use a queue instead of inline retries?

Inline retries are fine for synchronous user-facing requests (the user is waiting). For background work, especially webhook fan-out or bulk operations, push to a queue (BullMQ, SQS, Inngest). Queues give you persistence (no work lost on crash), centralised rate-limiting, and natural DLQ behaviour. Threshold: anything that doesn’t need to complete within 5 seconds of the trigger.

Are there library options that already do all this?

Yes — use them when possible. p-retry or axios-retry for Node/TypeScript, tenacity for Python, the official Stripe / OpenAI SDKs all have built-in retry. The handwritten code above is for teaching/customising; in production you’ll usually configure the SDK and move on.

How do I know my retry logic actually works?

Test it in staging by deliberately injecting 429s and 503s with a tool like Toxiproxy or a simple sidecar proxy. Better yet, add a chaos hook that randomly returns 429 on ~1% of calls in staging. If your retries handle it gracefully there, you’ll be fine in prod — if they don’t, better to find out before customers do.