Architecture12 min read

Rate Limiting Your SaaS API: Patterns That Don't Break at Scale

Most SaaS rate limiting fails in one of two ways: too lax (one customer takes you down) or too aggressive (legitimate users get 429s and churn). Here are the patterns that actually hold up at scale, with implementation specifics.

Krishan K Agarwal

Senior System Architect & Fractional CTO

Updated May 2026Published Apr 2026

On this page

There are two ways rate limiting fails in production. The first is the 4am incident where one enterprise customer's runaway script issues 50,000 requests/minute and your database falls over because you had no per-tenant cap. The second is the slow-burn churn problem: legitimate power users keep hitting 429s on a tier they pay $500/month for, and they switch to a competitor without telling you why.

Both failures share a root cause: rate limiting was added late, as a single global cap, by a developer who picked the first algorithm they Googled. This guide is the framework I use after rebuilding rate limiting for half a dozen growth-stage SaaS — what algorithm to pick, where to enforce it, and how to layer limits so legitimate traffic gets through and abuse does not.

The four algorithms, ranked by when to actually use them

Pick the algorithm based on what you are protecting against. There is no universal best — each has a clean fit for one or two real-world scenarios.

Algorithm	Best for	Bursts allowed?	Implementation cost
Fixed window	Internal tools, simple per-IP caps	Yes (boundary spikes)	Low (Redis INCR + EXPIRE)
Sliding window log	Hard SLA enforcement, billing-tied limits	No	High (per-request log entries)
Sliding window counter	Most SaaS use cases — good precision, low cost	Smooth	Medium (two windows + math)
Token bucket	Fair-use limits where short bursts are fine	Yes (configurable)	Medium (Lua script in Redis)
Leaky bucket	Outbound traffic shaping to third-party APIs	No (queued)	Medium

Rate limiting algorithm comparison: when each pattern is the right call

Fixed window: simplest, has a known flaw

Count requests in a fixed time bucket (e.g. 'minute starting at 12:00:00'). When the count exceeds the cap, reject. The flaw: a user can fire 100 requests at 12:00:59 and another 100 at 12:01:00 and bypass a 100/minute cap. For internal dashboards or rough abuse prevention, fine. For anything user-visible, move up.

Sliding window counter: the right default

Track two adjacent fixed windows and weight the older one by how far you are into the current one. Smooths out boundary spikes, costs about 2x the storage of fixed window, and gives you near-perfect precision in practice. This is the algorithm I default to for almost every per-user rate limit on a SaaS API.

Token bucket: fair use with bursts

The user has a bucket of N tokens. Each request consumes one. Tokens refill at R per second up to N. Allows bursts up to bucket size, then settles into the refill rate. Excellent for 'we let you spike to 100 requests but sustain 10/sec' semantics. Implementation: a Lua script in Redis that does the bucket math atomically.

Leaky bucket: outbound shaping

Requests enter a queue and drain at a fixed rate. If the queue fills, new requests are dropped. Best for shaping your own outbound calls to a third-party API with strict rate limits (Stripe, Slack, Twilio). Less useful for inbound traffic, where you usually want to reject immediately rather than queue.

Where to enforce: edge, gateway, or application?

The right answer is 'all three, with different limits.' Each layer protects against a different failure mode and costs a different amount of latency.

Edge (Cloudflare, Fastly, AWS WAF). Protects against bots, scrapers, credential stuffing. Free up to 10K rules on Cloudflare Pro. Adds zero latency on cache hits, drops malicious traffic before it reaches your origin.
API gateway (Kong, AWS API Gateway, Nginx). Enforces per-API-key or per-tenant limits at the entry point. AWS API Gateway throttling is built in and free up to its included usage.
Application (Redis-based). Enforces per-user fairness, per-feature limits, business-rule caps tied to subscription tier. The most flexible layer because you have full request context.

A typical layered setup: Cloudflare WAF blocks anything obviously malicious (10K req/min from one IP). API gateway enforces per-API-key limits per plan tier. Application-level Redis enforces per-user-per-endpoint limits and tenant fairness. Three layers, three different jobs.

Multi-tier rate limits in practice

Real SaaS APIs need at least three limit dimensions, often five. A single global cap is not rate limiting — it is abdication. Here are the dimensions I configure in nearly every production system.

Per-IP, applied at the edge. Stops obvious abuse and DDoS attempts before they cost you compute. Typical cap: 1,000-5,000 requests/minute per IP for a B2B SaaS.
Per-API-key (or per-user). Enforces fairness between paying customers. Tied to subscription tier. Typical cap: 60-600 requests/minute depending on plan.
Per-tenant (organization). Prevents one large team from starving smaller ones on shared infrastructure. Typical cap: 1,000-10,000 requests/minute per organization.
Per-endpoint. Hot endpoints (search, exports) get tighter caps than cheap ones (status checks). Search at 10/min, status at 600/min, for example.
Per-account-action. For sensitive actions like password reset or invite send, count separately and limit aggressively (5 per hour per user).

Implementing rate limits in Redis

Most production rate limiters I encounter are Redis-backed. The simple fixed-window pattern is two lines. The sliding-window-counter pattern is about 15 lines of Lua executed atomically. Token bucket is a Lua script with three Redis ops.

The trick is to keep all the math inside a single Redis call (Lua script or pipelined transaction) so you do not have race conditions where two requests read the same count, both think they are under the limit, and both proceed. Redis Cloud and Upstash both support EVAL with Lua at no extra cost. P99 of 1-3 ms in-region for a sliding-window-counter operation.

If you do not want to write the Lua, use a battle-tested library. For Node, ratelimit-redis or @upstash/ratelimit. For Python, limits or python-redis-rate-limiters. For Go, ulule/limiter. These wrap the algorithms and the atomic semantics correctly.

What a good 429 response looks like

A 429 is a contract between your API and a client. Send the right headers and well-behaved clients back off correctly. Skip them and clients hammer your service.

Status 429 Too Many Requests. Not 503, not 403, not 200 with an error body. Status codes are how clients reason about retry behavior.
Retry-After header. Either a number of seconds or an HTTP date. Most clients (axios-retry, requests) read this automatically.
X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. These let well-behaved clients pace themselves before hitting the wall.
JSON body with a structured error code. Something like { "error": "rate_limit_exceeded", "retry_after": 30, "limit": 100, "window": "60s" }. Lets client code branch on the code field, not error message text.
Document the per-endpoint limits in your API docs. Stripe, GitHub, and Linear all do this; clients write code expecting it.

On the client side, recommend exponential backoff with jitter for retries. A simple formula: sleep(base * 2^attempt + random(0, base)) capped at a maximum. Without jitter, a thousand clients all retry at the same instant and you have rebuilt the original problem.

Real attack scenarios this prevents

Layered rate limiting is not abstract — it stops specific attacks I have seen in production over the last few years.

Credential stuffing. An attacker tries 50K leaked email/password combos against your login endpoint. Cloudflare WAF caps per-IP login attempts at 30/minute and the attack stalls before reaching auth code.
Runaway customer scripts. An enterprise tenant deploys a poorly-written script that calls your search endpoint 10K times/minute. Per-tenant limit caps them at 1K/minute; smaller customers see no impact.
Scraper bots. Headless browsers crawl your public catalog 24/7. Edge rate limits per ASN and per IP make scraping economically unattractive without breaking real users.
Password reset abuse. Attacker triggers password reset for thousands of accounts. Per-account-per-hour limit caps it at 5, mailbox provider does not flag you as a spammer.
Webhook storms from a third party. A partner's webhook system goes haywire and floods your endpoint. Per-API-key limit on inbound webhooks throttles them, your queue stays healthy.

Anti-patterns I see in audits

Predictable mistakes that show up across most rate-limiting implementations I review.

Single global rate limit. 1,000 req/sec on the entire API. Useless against the actual abuse patterns and easy to hit on a normal traffic spike.
Returning 200 with an error body instead of 429. Breaks every retry library on earth. Always use the status code.
In-memory rate limiter on a multi-instance deployment. Each instance has its own counter; effective limit is N times what you intended.
No per-endpoint differentiation. Your /search endpoint costs 100x more than /healthcheck — give them different limits.
Counting against the user when the user is not authenticated. Now everyone shares one bucket and a single bot drains it. Pre-auth limits should be per-IP or per-fingerprint.
No allowlist for internal callers. Your monitoring system hits /healthcheck 600 times/minute and starts getting 429s during incidents.
Limits that only protect compute, not your downstream dependencies. Your API survives the burst but your Stripe webhook handler hits the Stripe API ceiling and customers cannot pay.

What to ship this sprint

If you have no rate limiting today, here is the order of operations. Each step ships in a day or less for a typical SaaS.

Turn on Cloudflare WAF rate limiting. Cap per-IP at 5K/minute on /api routes. Free on Pro plan, deploys in minutes.
Add Redis-based per-user rate limiting on your authenticated endpoints. Use a library; do not roll your own Lua on day one. Default 60 req/minute.
Add per-tenant fair-share limits. Default to 10x your per-user limit. Catches the runaway-script case before it hurts other tenants.
Add tighter limits on three sensitive endpoints: login, password reset, signup. 5-30 attempts per hour per IP and per email.
Document the limits in your API docs. Add Retry-After to 429 responses. Test that retry libraries on the client side respect them.

Where this connects

Rate limiting is one of the three architectural decisions that compound the most as you scale, alongside caching strategy and multi-tenant data isolation. If you have a Redis instance for caching, you already have the infrastructure for rate limiting — and the multi-tenant SaaS architecture guide goes into how per-tenant fairness works at the data layer too.

If your API has been on the wrong end of an abuse incident or you are designing public endpoints from scratch, an architecture audit will find the gaps before someone else does. Three to five days, fixed price, prioritized findings.

Frequently asked questions

What is the simplest rate limiter that works in production?

Redis with INCR + EXPIRE is the classic 20-line implementation that powers a surprising amount of production traffic. INCR a key like rl:user:42:60s, set EXPIRE on first hit, reject when the count exceeds your threshold. P99 of 1-3 ms on managed Redis. Good enough for most B2B SaaS up to high six figures of MRR before you start needing sliding-window precision.

Should I use Cloudflare WAF rate limiting or do it in my application?

Both, and they protect against different attacks. Cloudflare WAF rate limits stop bots and credential stuffing before they cost you compute (free up to 10K rules on the Pro plan). Application-level rate limits enforce per-user fairness and prevent a single tenant from monopolizing your database. The two layers complement each other; do not pick one.

Token bucket vs sliding window — which should I default to?

Default to token bucket for fair-use limits where short bursts are fine. Default to sliding window for hard SLA enforcement where 'no more than 100 calls in any rolling 60 seconds' actually matters. Fixed window is the simplest to implement but suffers from boundary spikes. Leaky bucket is best for shaping outbound traffic to a third-party API with strict limits.

What should a 429 response look like?

Status 429 Too Many Requests, with three headers minimum: Retry-After (seconds until the client can retry), X-RateLimit-Limit (the cap), and X-RateLimit-Remaining (how many calls are left in the current window). The body should be a small JSON object with a code, message, and the retry timestamp. Clients written by competent engineers will respect Retry-After; clients written by interns will not, which is why you also need server-side enforcement.

How do I rate limit when my SaaS is multi-region?

Two patterns. Per-region limits using region-local Redis are the simplest — a user gets effectively N times the limit if they hit N regions, which is usually fine. Globally consistent limits require a single central Redis or a system like DynamoDB with strong consistency, which adds 30-80 ms of latency. Most SaaS take the per-region approach until a real abuse vector forces them to upgrade.

APIRate LimitingSaaS

Architecture

Caching Strategy for SaaS: Redis, Memcached, or CDN First?

Most SaaS apps cache wrong. They reach for Redis on day one and skip the CDN that would have served 80 percent of their traffic for free. Here is the layered caching strategy I recommend after auditing 30+ production systems.

11 min readRead

Architecture

Multi-Tenant SaaS Architecture: Pool, Bridge, or Silo?

Most B2B SaaS founders agonize over multi-tenant architecture and pick wrong on day one. Here is the honest comparison of pool, bridge, and silo — and why most companies stay in pool forever, with code-level patterns and a real migration path.

13 min readRead

Architecture

AI-Native SaaS Architecture in 2026: Patterns That Actually Work

Putting an LLM in the critical path changes everything: cost accounting, deploy gates, retries, caching, observability. Here is the 2026 reference architecture I run with AI-native startups, with real numbers.

13 min readRead

Want a senior eye on your stack?

If you are scoping an MVP, scaling a SaaS, or staring at an inherited codebase, book a 30-minute call. No pitch deck required.

Book a strategy call See architecture audit

The four algorithms, ranked by when to actually use them

Fixed window: simplest, has a known flaw

Sliding window counter: the right default

Token bucket: fair use with bursts

Leaky bucket: outbound shaping

Where to enforce: edge, gateway, or application?

Multi-tier rate limits in practice

Implementing rate limits in Redis

What a good 429 response looks like

Real attack scenarios this prevents

Anti-patterns I see in audits

What to ship this sprint

Where this connects

Frequently asked questions

Related articles

Caching Strategy for SaaS: Redis, Memcached, or CDN First?

Multi-Tenant SaaS Architecture: Pool, Bridge, or Silo?

AI-Native SaaS Architecture in 2026: Patterns That Actually Work

Want a senior eye on your stack?