API Rate Limited in Production at 3 AM: Building Retry Logic That Actually Works

API Rate Limited in Production at 3 AM: Building Retry Logic That Actually Works

By Elena Rodriguez · April 27, 2026 · 17 min read

Quick Answer

Naive retry-on-429 is the bug. The pattern that works in production: read `Retry-After` and `X-RateLimit-Reset` headers when present, fall back to exponential backoff with full jitter, cap retries at 3-5, wrap calls in a circuit breaker that fails fast when an upstream is down, route bursty work through a queue (BullMQ, SQS) with concurrency limits, and alert on retry rate, not error count.

Key Insight

Naive retry-on-429 is the bug. The pattern that works in production: read `Retry-After` and `X-RateLimit-Reset` headers when present, fall back to exponential backoff with full jitter, cap retries at 3-5, wrap calls in a circuit breaker that fails fast when an upstream is down, route bursty work through a queue (BullMQ, SQS) with concurrency limits, and alert on retry rate, not error count.

The 3 AM Page

It is 3:14 AM. Your phone buzzes. The summary says: "Order processing pipeline backed up — 8,400 jobs queued, error rate 47%."

You stumble to your laptop. The Sentry dashboard is full of 429 Too Many Requests from the Stripe webhook handler. You look closer — the actual webhook traffic is normal, but your retry logic, written 18 months ago by someone who has since left, is hammering Stripe with retries every 100ms whenever it sees a 429. Stripe is now rate-limiting your IP entirely.

You manually push a fix. By 4:30 AM the queue drains. By 9 AM your manager wants to know why a third-party rate limit took down your service.

This guide is the answer. It covers the actual mechanics of rate limits, how to read 429 responses correctly, the retry math that survives production load, and the patterns — circuit breakers, queues, idempotency — that make a system resilient instead of fragile.

For broader context on shipping reliable systems, our TypeScript best practices for 2026 covers type-safe error handling patterns, and our API testing tools roundup for 2026 lists the tools to validate retry behavior in CI.

How Rate Limits Actually Work

Three algorithms power 99% of production rate limiters. Understanding which your provider uses determines how to handle their 429s.

Token Bucket

A bucket holds N tokens. Each request consumes one. The bucket refills at a fixed rate (e.g., 100 tokens per second). When empty, requests fail with 429.

Used by: AWS APIs, OpenAI, most cloud APIs. Allows bursts up to bucket size, smooths to refill rate over time.

Sliding Window

Tracks request count in a moving time window (e.g., last 60 seconds). When count exceeds limit, requests fail.

Used by: GitHub API (5000 requests per hour, sliding), Twitter/X v2 API. More accurate than fixed window but more expensive to compute server-side.

Fixed Window

Counts requests in discrete time windows (e.g., 100 per minute, resetting at the top of each minute). When count exceeds limit, requests fail until the window resets.

Used by: many internal services and budget APIs. Has a "boundary effect" — doubles allowed in transit during the brief moments around reset boundaries.

Concurrency Limits (Less Common)

Limits the number of in-flight requests rather than rate. Twilio uses this for some endpoints, as do some database APIs.

The implementation matters because it changes how you should retry. With a token bucket, waiting briefly often refills enough to succeed. With sliding window, you may need to wait for the entire window to roll forward.

Reading 429 Responses Correctly

The HTTP standard for 429 is RFC 6585, and the relevant retry header is in RFC 7231. In practice, providers add their own headers.

Retry-After

The standard. Two valid forms:

  • Seconds: Retry-After: 30
  • HTTP date: Retry-After: Wed, 21 Oct 2026 07:28:00 GMT

Code that handles both:

typescript
function parseRetryAfter(header: string | null): number | null {
  if (!header) return null;
  const seconds = Number(header);
  if (!Number.isNaN(seconds)) return seconds * 1000;
  const date = Date.parse(header);
  if (!Number.isNaN(date)) return Math.max(0, date - Date.now());
  return null;
}

Provider-Specific Headers

Beyond Retry-After, providers expose remaining-quota and reset-time headers:

  • OpenAI: x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, x-ratelimit-reset-requests, x-ratelimit-reset-tokens (the latter as ISO 8601 durations like 60s).
  • GitHub: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset (Unix timestamp).
  • Stripe: Retry-After only on rate-limit responses.
  • Twilio: Retry-After on overload; concurrent request limits surfaced as 429 with explanatory body.
  • SendGrid: X-RateLimit-Remaining, X-RateLimit-Reset.

Always check the provider's docs and use what they give you. Falling back to your own backoff math is the safety net, not the primary plan.

Exponential Backoff With Full Jitter

When you don't have a Retry-After to honor, the backoff math you want is exponential backoff with full jitter. The canonical reference is the AWS Architecture Blog post Exponential Backoff and Jitter by Marc Brooker.

The naive version (don't use):

typescript
// BAD — no jitter, synchronized retries
const delay = Math.pow(2, attempt) * baseMs;

The full-jitter version (use this):

typescript
// GOOD — full jitter
const delay = Math.random() * Math.min(maxMs, Math.pow(2, attempt) * baseMs);

With baseMs = 100 and maxMs = 30000, retries fall in: [0,200], [0,400], [0,800], [0,1600], [0,3200], etc. Distributing across the entire window (rather than [X/2, X]) decorrelates retries across all clients and is the AWS-blessed default.

A complete TypeScript retry helper

typescript
type RetryOptions = {
  maxAttempts?: number;
  baseMs?: number;
  maxMs?: number;
  isRetryable?: (err: unknown) => boolean;
};

async function withRetry<T>(
  fn: () => Promise<T>,
  opts: RetryOptions = {}
): Promise<T> {
  const {
    maxAttempts = 4,
    baseMs = 200,
    maxMs = 30_000,
    isRetryable = defaultIsRetryable,
  } = opts;

  let lastErr: unknown;
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastErr = err;
      if (!isRetryable(err) || attempt === maxAttempts - 1) throw err;

      const retryAfter = extractRetryAfter(err);
      const backoff = Math.random() * Math.min(maxMs, baseMs * 2 ** attempt);
      const delay = retryAfter ?? backoff;

      await new Promise((r) => setTimeout(r, delay));
    }
  }
  throw lastErr;
}

function defaultIsRetryable(err: unknown): boolean {
  if (!(err instanceof Error)) return false;
  const status = (err as any).status ?? (err as any).response?.status;
  if (status === 429) return true;
  if (status >= 500 && status < 600) return true;
  if ((err as any).code === "ECONNRESET") return true;
  if ((err as any).code === "ETIMEDOUT") return true;
  return false;
}

Python equivalent

python
import asyncio
import random
from typing import Awaitable, Callable, TypeVar

T = TypeVar("T")

async def with_retry(
    fn: Callable[[], Awaitable[T]],
    max_attempts: int = 4,
    base_ms: int = 200,
    max_ms: int = 30_000,
) -> T:
    last_err: Exception | None = None
    for attempt in range(max_attempts):
        try:
            return await fn()
        except Exception as err:
            last_err = err
            if not is_retryable(err) or attempt == max_attempts - 1:
                raise
            retry_after = extract_retry_after(err)
            backoff = random.random() * min(max_ms, base_ms * (2 ** attempt))
            await asyncio.sleep((retry_after or backoff) / 1000)
    assert last_err is not None
    raise last_err

Circuit Breakers: Failing Fast When the Upstream Is Down

Retry logic helps when failures are transient. When an upstream is fully down, retries make things worse. Each retry occupies a worker for 30 seconds (the timeout). Your worker pool fills with stuck retries. Your service can't even respond to unrelated user requests.

A circuit breaker fixes this. It monitors error rate over a window and "opens" — failing all requests immediately — when the rate exceeds a threshold. It periodically tries again ("half-open"), and reverts to "closed" when the upstream recovers.

Node — using opossum

typescript
import CircuitBreaker from "opossum";

const breaker = new CircuitBreaker(callStripe, {
  timeout: 5000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000,
  rollingCountTimeout: 10000,
  rollingCountBuckets: 10,
});

breaker.on("open", () => log.warn("Stripe circuit OPEN"));
breaker.on("halfOpen", () => log.info("Stripe circuit HALF-OPEN"));
breaker.on("close", () => log.info("Stripe circuit CLOSED"));

// Use it
const result = await breaker.fire(payload);

Python — using pybreaker

python
import pybreaker

stripe_breaker = pybreaker.CircuitBreaker(
    fail_max=5,
    reset_timeout=30,
    exclude=[ValidationError],
)

@stripe_breaker
def call_stripe(payload):
    return stripe.PaymentIntent.create(**payload)

The Hystrix model — popularized by Netflix and documented at Microsoft's Cloud Design Patterns — is the conceptual reference if you want to dig deeper.

Queue-Based Architectures: Concurrency Control at the Right Layer

For any non-trivial volume of API calls, the right architecture is a queue with rate-limited workers. Your application enqueues jobs; workers consume them at a controlled rate.

BullMQ for Node

typescript
import { Queue, Worker } from "bullmq";

const queue = new Queue("openai-jobs", { connection: redis });

const worker = new Worker(
  "openai-jobs",
  async (job) => {
    return await withRetry(() => callOpenAI(job.data), {
      isRetryable: (e) => isOpenAIRetryable(e),
    });
  },
  {
    connection: redis,
    concurrency: 4,                    // 4 in-flight requests
    limiter: { max: 50, duration: 60_000 }, // global 50 req/min cap
  }
);

The concurrency and limiter settings give you per-worker concurrency and global rate. You can run multiple worker processes; the limiter is enforced across all of them via Redis.

Python — Celery / Dramatiq / arq

For Python, Celery's rate_limit per task and Dramatiq's middleware-based throttle work similarly. arq has max_jobs per worker. The pattern is the same: control the rate where workers consume, not where producers enqueue.

SQS + Lambda

For serverless, AWS SQS plus Lambda's ReservedConcurrentExecutions is the canonical setup. Set the reserved concurrency to whatever rate the upstream tolerates. SQS's visibility timeout serves as the implicit retry: if the worker fails to delete the message within the window, SQS redelivers.

For the AI-specific case (OpenAI, Anthropic), the recommendations in the OpenAI cookbook on rate limit handling are the gold standard.

Idempotency: Mandatory for Retried POST

Retrying a GET is safe. Retrying a POST is dangerous unless the operation is idempotent.

Generate an idempotency key per logical operation:

typescript
import { randomUUID } from "crypto";

const idempotencyKey = randomUUID();

await withRetry(() =>
  stripe.paymentIntents.create(payload, { idempotencyKey })
);

Stripe stores the idempotency key for 24 hours. The first request executes; subsequent requests with the same key return the original response. This is the difference between a clean retry and a duplicate charge.

OpenAI exposes an Idempotency-Key header. Twilio exposes Idempotency-Token on most write endpoints. SendGrid uses message IDs. Always use them on POSTs that retry.

For more on the agent-specific case where AI workflows generate the API calls, see why your AI agent loses context and how to fix it — context loss across retries is a particularly nasty interaction.

Provider-Specific Quirks

OpenAI

  • Two limit dimensions: RPM (requests per minute) and TPM (tokens per minute).
  • Headers: x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, x-ratelimit-reset-requests, x-ratelimit-reset-tokens.
  • Reset format: ISO 8601 duration string (60s, 1m, 6h).
  • Tier-based limits. Higher usage history unlocks higher limits.
  • Streaming responses count differently. Tokens are billed as generated, not as estimated.

For batch workloads, OpenAI's Batch API gives 50% discounts at the cost of up to 24h latency — the right tool when latency isn't critical.

Stripe

  • 100 read + 100 write requests per second per account.
  • Burst tolerance is high — Stripe absorbs short spikes.
  • Idempotency keys mandatory for safe retry of payment creation.
  • Test mode and live mode have separate limits.

Twilio

  • Concurrency-based. Total in-flight requests per account.
  • Per-phone-number throughput. Long codes ~1 SMS/sec; toll-free and short codes higher.
  • A2P 10DLC has its own throughput rules in the US.

SendGrid

  • IP warmup-based. New IPs throttle to a few hundred sends/day; ramps up over weeks.
  • Account category limits. Marketing vs transactional separate.
  • Burst rejections return 429 with X-RateLimit-Reset.

Always read each provider's API documentation. Generic retry libraries fail when they assume one model.

Monitoring: What to Alert On

Bad alerting causes more pages than bad code. The metrics that matter for retry behavior:

  1. Retry rate per upstream — sudden increases indicate upstream degradation. Alert when > 5x baseline for 5 minutes.
  2. Success-after-retry rate — high (>90%) means retries are working. Low (<50%) means retries aren't helping and you have a deeper issue.
  3. Permanent failures — requests that exhausted all retries. These are user-visible. Alert on absolute rate.
  4. Circuit breaker state — every state change should be logged. Page on "open" lasting > 5 minutes.
  5. Queue depth and age — for queue-based architectures, queue depth growing without recovery is the canonical "we're behind" signal.

In Sentry, attach retry context as tags:

typescript
Sentry.setTag("retry.attempt", String(attempt));
Sentry.setTag("retry.upstream", "openai");
Sentry.setTag("retry.reason", String(status));

In Datadog, custom metrics with the same dimensions plus dashboards for the four metrics above.

What Good Looks Like

A production-grade external API call in 2026 has this shape:

  1. Idempotency key generated once per logical operation.
  2. Wrapped in a circuit breaker to fail fast when upstream is down.
  3. Retries handled by an inner helper with full-jitter exponential backoff and proper Retry-After parsing.
  4. Submitted via a rate-limited queue with concurrency aligned to the upstream's tolerance.
  5. Observability tags for retry attempt, reason, and upstream.
  6. Alerts on retry rate, permanent failures, and circuit state changes — not raw error counts.

That sounds like a lot. In practice, it is 200 lines of code that you write once, abstract behind a client wrapper, and reuse for every external dependency. The night you don't get paged at 3 AM is the night you realize it was worth it.

For more on developer tooling that makes building this kind of resilience faster, see our best AI tools for developers in 2026 — modern AI coding assistants reduce the friction of writing and reviewing the kind of error-handling code that, historically, no one wanted to write.

A Final Sanity Check

Before you ship retry logic, ask:

  • Does it parse Retry-After correctly for both seconds and date forms?
  • Does it use full jitter, not equal jitter?
  • Does it have a maximum attempt count?
  • Does it have a maximum backoff?
  • Does it differentiate retryable from non-retryable errors?
  • Does it carry an idempotency key on POSTs?
  • Is it wrapped in a circuit breaker?
  • Are retries metered by tag in your observability?

If any of those answers is "no," you have homework. If all of them are "yes," you have something that survives 3 AM.


For the wider context on shipping production-grade backend systems, see our pillar guide: [Best AI Tools for Developers in 2026](/blog/best-ai-tools-for-developers-2026).

Key Takeaways

  • Naive "retry on any error" is worse than no retry — it amplifies upstream outages into thundering herds
  • Always parse `Retry-After` (seconds or HTTP date) and provider-specific headers like `X-RateLimit-Reset` before computing your own backoff
  • Exponential backoff with FULL jitter — not "equal jitter" or "decorrelated jitter" — is the AWS-blessed default for retry timing
  • A circuit breaker prevents your service from drowning in 30-second timeouts when an upstream is fully down
  • Queue-based architectures (BullMQ, SQS, Cloud Tasks) let you control concurrency at the worker level rather than per-request
  • Idempotency keys are mandatory for any retried POST — Stripe, Twilio, and OpenAI all support them
  • Alert on the rate of retries and the proportion of failed-after-retry requests, not raw error counts

Frequently Asked Questions

Why is naive retry logic dangerous?

When an upstream service is overloaded, every client retrying immediately on failure adds to the load. This "thundering herd" turns a brief degradation into a sustained outage. Without jitter, all clients retry at the same offsets and create synchronized waves. Without a backoff cap, retries keep escalating long after the upstream needs help. Without a circuit breaker, every request waits a full timeout before failing. Naive retries don't just fail — they extend and deepen failures.

What is exponential backoff with full jitter?

Exponential backoff doubles the wait time between retries (1s, 2s, 4s, 8s). Full jitter, recommended in [AWS's seminal blog post](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/), randomizes within the entire backoff window: `sleep = random(0, base * 2^attempt)`. This decorrelates retries across clients and avoids the synchronized waves that "equal jitter" still produces. It is the simplest, most reliable choice for general use.

How do I parse the Retry-After header correctly?

Retry-After can be either a number of seconds (`Retry-After: 30`) or an HTTP date (`Retry-After: Wed, 21 Oct 2026 07:28:00 GMT`). Both forms are valid per RFC 7231. Code should handle both: parse the value, try Number() first, fall back to Date.parse(). When the value is malformed or missing, fall back to your own exponential backoff. Never block on a Retry-After value over 10-15 minutes — fail the job and surface the error.

When should I use a circuit breaker?

Use one for any external dependency that can become fully unavailable. Without a breaker, your worker pool fills with requests waiting on 30-second timeouts, your latency spikes, and your service degrades for ALL users — even ones whose requests don't touch the broken upstream. A circuit breaker fails fast after a configurable error threshold (e.g., 50% errors in a 1-minute window) and rechecks periodically with half-open probes. `opossum` for Node and `pybreaker` for Python are the standards.

Should I retry inside the API client or at the queue worker?

Both, with different responsibilities. Inside the client: handle transient network errors, single-shot 429s, and 5xx spikes with 2-3 quick retries. At the queue level: handle persistent failures, back off over minutes/hours, alert on stuck jobs, and provide visibility into stuck work. This two-layer approach prevents queue retries from masking simple network blips while still catching genuine outages.

What is an idempotency key and why does retry require one?

An idempotency key is a unique, client-generated identifier that the server uses to deduplicate requests. If your "create payment" request fails after the server processed it but before it sent the response, retrying without an idempotency key creates a duplicate charge. With one, the server returns the original response and skips re-execution. Stripe, Twilio, OpenAI, and most modern APIs support this. Use a UUID per logical operation, not per HTTP request.

How do OpenAI, Stripe, and Twilio rate limits differ?

OpenAI uses a token-bucket-per-organization model with separate limits for requests-per-minute and tokens-per-minute, surfaced via `x-ratelimit-remaining-*` and `x-ratelimit-reset-*` headers. Stripe enforces 100 read and 100 write requests per second per account, returning `Retry-After` on bursts. Twilio uses concurrency limits per account and per phone number, plus a global queue depth limit. SendGrid has IP-warmup-based limits for new accounts. Always read each provider's specific docs.

How do I monitor retry behavior in production?

Track four metrics: (1) retry rate per upstream — sudden increases indicate degradation; (2) success-after-retry percentage — high values mean retries are doing their job; (3) requests permanently failed after retries — these are user-visible failures; (4) circuit breaker state changes. In Sentry/Datadog, attach `retry.attempt` and `retry.reason` tags. Alert on rate of permanent failures, not raw error counts.

About the Author

E

Elena Rodriguez

Full-Stack Developer & Web3 Architect

BS Software Engineering, Stanford | Former Lead Engineer at Coinbase

Elena Rodriguez is a full-stack developer and Web3 architect with seven years of experience building decentralized applications. She holds a BS in Software Engineering from Stanford University and has worked at companies ranging from early-stage startups to major tech firms including Coinbase, where she led the frontend engineering team for their NFT marketplace. Elena is a core contributor to several open-source Web3 libraries and has built dApps that collectively serve over 500,000 monthly active users. She specializes in React, Next.js, Solidity, and Rust, and is particularly passionate about creating intuitive user experiences that make Web3 technology accessible to mainstream audiences. Elena also mentors aspiring developers through Women Who Code and teaches a popular Web3 development bootcamp.