Skip to main content
aibizhub

Pillar Guide · 12 min · 6 citations

Prompt Caching ROI: When It Pays Off, When It Does

Anthropic prompt caching mechanics, hit-rate sensitivity, and the break-even derivation. Worked example: 61-84% input cost reduction on a docs assistant.

By Orbyd Editorial · Published May 8, 2026

Education · General business information, not legal, tax, or financial advice. Editorial standards Sponsor disclosure Corrections

TL;DR

Prompt caching is an Anthropic and OpenAI API feature that lets you reuse a prefix of input tokens at a discounted rate, with a separate (higher) cost on the first write. Caching pays off when the same prefix is reused often enough that the cumulative read savings exceed the one-time write premium plus the eviction risk. The break-even hit count is approximately 2-3 reads at Anthropic's 90% read discount and 25% write premium.

Caching always pays off for stable system prompts above 1,024 tokens (Claude minimum) reused across requests within five minutes. Caching never pays off for one-shot prompts or for prefixes that change per-request. The middle case (prefixes reused 2-5 times per TTL window) is where the math matters and where most production teams either over-cache or under-cache.

Anthropic introduced prompt caching in August 2024 and OpenAI matched in October 2024. The mechanics are simple, the pricing is more complex, and the break-even math is rarely written down. This article covers the math, the hit-rate sensitivity, and the workload shapes where caching wins or loses.

1. What prompt caching actually is

Prompt caching is a server-side cache of input-token activations keyed by a prefix hash. When you send a request with a marked cacheable prefix, the provider checks whether that exact prefix has been seen recently. If yes (a hit), the cached activations are reused and you are charged a discounted rate for those tokens. If no (a miss), the activations are computed and cached, and you are charged a premium rate for those tokens to cover the write.

The key parameters at Anthropic[1]:

  • Minimum prefix size: 1,024 tokens for Sonnet and Opus, 2,048 tokens for Haiku. Below this, caching is not eligible.
  • TTL: 5 minutes default, extendable to 1 hour on Anthropic with a higher write multiplier. Cache entries are evicted on TTL or under memory pressure.
  • Granularity: Caching is prefix-only. Anything before the cache breakpoint is cached; anything after is not. Order matters: stable content first, variable content last.
  • Pricing: Cache writes cost 1.25x base input rate. Cache reads cost 0.10x base input rate (a 90% discount).

OpenAI's caching is similar but simpler: cached input tokens are billed at 50% of the base rate, with no separate write premium and no explicit TTL exposed to the developer[3]. Google Gemini supports explicit context caching with a per-token storage charge billed by the hour[6]. The break-even math below uses Anthropic's pricing because the trade-off is more visible there; the same logic applies at OpenAI with smaller magnitudes.

2. Pricing mechanics: write, read, miss

Anthropic's published rates as of access date[2]:

Model              Base input   Cache write   Cache read   Output
Claude Opus 4      $15.00/M     $18.75/M      $1.50/M      $75.00/M
Claude Sonnet 4    $3.00/M      $3.75/M       $0.30/M      $15.00/M
Claude Haiku 3.5   $0.80/M      $1.00/M       $0.08/M      $4.00/M

Three operations occur on a request with caching enabled:

  • Cache miss with write: The marked prefix is not in cache. You pay the cache-write rate (1.25x base) for those tokens, and the prefix is stored.
  • Cache hit: The marked prefix is in cache. You pay the cache-read rate (0.10x base) for those tokens.
  • Cache miss without write: Same as a regular API call with no caching. You pay the base input rate.

The 1.25x write premium and 0.10x read discount are the two numbers that drive the break-even calculation. The 0.10x read rate is what makes caching attractive; the 1.25x write rate is what creates the break-even threshold.

3. Break-even derivation

Let P be the cost of a regular (uncached) input. Let W be the write cost (1.25P). Let R be the read cost (0.10P). For N total requests with cache enabled, where the first request is always a write and the remaining N-1 may be hits or misses, hit-rate H gives expected cost:

Cost(cached) = W + (N-1) * (H * R + (1-H) * W)

The uncached baseline is simply Cost(uncached) = N * P. Caching pays off when Cost(cached) < Cost(uncached), which simplifies to:

1.25 + (N-1) * (0.10H + 1.25(1-H)) < N

Solving for the break-even hit count at H = 1.0 (perfect hits, no eviction):

1.25 + (N-1) * 0.10 < N → N > 1.25 / 0.90 ≈ 1.39

At H = 1.0 (no eviction during the TTL window), caching wins after the second request. The first request pays a 25% premium; the second request and beyond pay a 90% discount. Two reads against one write covers the premium with margin to spare.

At H = 0.5 (half of subsequent requests fall outside the TTL and require a re-write), break-even moves to roughly N = 5. Below 50% hit rate, caching becomes a tax rather than a discount because the write premium is paid repeatedly without enough reads to amortize it.

4. Hit-rate sensitivity table

The break-even hit count for different hit rates and prefix sizes, assuming Claude Sonnet pricing and a stable workload pattern:

Hit rate H    Break-even N    Net savings at N=10    Net savings at N=100
1.0 (ideal)   2 requests      88% reduction          90% reduction
0.9           2 requests      80% reduction          81% reduction
0.7           3 requests      62% reduction          63% reduction
0.5           5 requests      45% reduction          45% reduction
0.3           18+ requests    27% reduction          27% reduction
0.1           never           +5% surcharge          +5% surcharge

The takeaway. Above 50% hit rate, caching always wins for any workload of 5+ requests. Below 30% hit rate, caching is a net loss because the write premium is paid more often than the read discount earns it back. Most production workloads with a stable system prompt sit between 70-95% hit rate; most workloads with per-user state below the cache breakpoint sit below 30%.

Hit rate is reported in the Anthropic API response under usage.cache_read_input_tokens and usage.cache_creation_input_tokens[5]. Aggregate these per route weekly to know your actual hit rate; do not rely on intuition about cache effectiveness.

5. Which workload shapes win

Caching wins decisively for these workload shapes:

  • RAG with stable corpus. A retrieval-augmented assistant where 80% of the prompt is a fixed system prompt + retrieved documents that change slowly. Cache the system prompt + the top-K retrieved chunks per session; expect 70-90% hit rate within a session.
  • Few-shot examples. A classification or extraction prompt with 5-20 stable few-shot examples followed by the variable input. Cache the few-shot block; expect 95%+ hit rate as long as the examples do not change.
  • Long context coding. A coding assistant where the user's repo (or a relevant subset) is in context. Cache the repo content; expect 80%+ hit rate within a session, dropping to 0% when the user starts a new session after 5 minutes.
  • Document Q&A with a single source. Cache the document; ask multiple questions against it within the TTL window. Hit rate approaches 100% for the question-asking phase.

Caching loses or breaks even for these shapes:

  • One-shot generation. A user submits a single prompt and never returns within the TTL. The write premium is paid; no read discount is earned. Net loss of 25%.
  • Highly personalized prefixes. The system prompt includes per-user state above the cache breakpoint. Cache writes happen per user, hit rate is 0% across users, only sequential requests by the same user hit. Marginal value at best.
  • Prefixes shorter than the minimum. Below 1,024 tokens (Sonnet) or 2,048 tokens (Haiku), caching is not eligible. The system silently treats the request as uncached.
  • Workloads with TTL-aligned bursts. A nightly batch job where a 50,000-token system prompt is reused across 1,000 requests in a 30-minute window. Hit rate is high but the workload pattern fits the 1-hour TTL extension better; the 5-minute default forces re-writes.

6. Caching anti-patterns

The four most common caching mistakes in production:

  • Caching the wrong layer. Marking the entire prompt as cacheable, including the user input. Anthropic only caches the prefix up to the breakpoint; everything before the breakpoint must be byte-identical to hit. If the user input is in the prefix, hit rate drops to 0%.
  • Caching too small a prefix. Marking a 500-token prefix on Sonnet (which requires 1,024 tokens minimum). The marker is silently ignored, no caching occurs, and the developer thinks caching is enabled.
  • Caching variable content. Putting today's date, a session ID, or a request ID near the start of the prompt. Even one variable byte invalidates the prefix hash. Move all variable content after the cache breakpoint.
  • Caching across model versions. Cache entries are keyed per model. Switching from claude-3-5-sonnet-20240620 to claude-sonnet-4-20250514 mid-rollout invalidates all cached entries and requires re-writes.

Anthropic's engineering team published real-world hit-rate measurements in the prompt caching cookbook[4]. Production teams that monitor cache hit rate and tune the breakpoint placement see hit rates in the 80-95% range; teams that enable caching once and never measure see hit rates between 30-60% with corresponding lost savings.

7. Worked example: a docs assistant

A solo founder builds a documentation assistant. The product loads a 25,000-token system prompt (instructions + 10 few-shot examples + retrieved doc snippets) and answers user questions in 200-token outputs. The product handles 50,000 user queries per month, with average session length of 4 questions and average inter-question gap of 90 seconds.

Without caching on Claude Sonnet 4:

Cost = 50,000 queries * 25,000 input tokens * $3/M = $3,750/month input

With caching enabled on the system prompt (the 25,000-token prefix is identical across queries within a session, but the next session in 5+ minutes is a fresh write):

  • Each session of 4 queries has 1 write + 3 hits.
  • 50,000 queries / 4 queries-per-session = 12,500 sessions/month.
  • Writes: 12,500 * 25,000 tokens * $3.75/M = $1,172/month
  • Reads: 37,500 * 25,000 tokens * $0.30/M = $281/month
  • Total input cost with caching: $1,453/month
  • Savings vs uncached baseline: 61% ($2,297/month)

The hit rate in this example is 75% (3 hits out of 4 requests per session). Sessions that bunch closer together (all 4 queries inside the 5-minute TTL) achieve 100% hit rate. Sessions that span TTL boundaries pay extra writes and drop hit rate accordingly.

The same workload on a one-hour cache TTL would push hit rate to ~100% (sessions extend up to an hour), reducing input cost to ~$610/month, an 84% reduction. The one-hour TTL costs 2x the write rate (so $7.50/M), making the break-even with the longer window slightly later but with much higher steady-state savings.

8. The implementation checklist

Six steps to implement caching profitably:

  1. Measure your current input-token cost per route. If a route consumes more than $20/month and has a stable prefix above the minimum size, caching is worth implementing.
  2. Identify the largest stable prefix in each route. System prompt + few-shot examples + retrieved documents are the typical candidates.
  3. Place the cache breakpoint after the stable prefix, before any variable content (user input, today's date, session ID).
  4. Deploy and measure hit rate for one week using cache_read_input_tokens and cache_creation_input_tokens from the API response.
  5. If hit rate is below 50%, debug. Common causes: variable content above the breakpoint, prefix too small, model version mismatch, sessions too short for the TTL.
  6. If your sessions span more than 5 minutes, evaluate the 1-hour TTL extension. The extra write premium is worth it only if your steady-state hit rate inside the window is above 70%.

Prompt caching is one of the few API features where the ROI math is unambiguous and the implementation cost is hours, not days. The 90% read discount is large enough that any production workload with a stable prefix above 1,024 tokens reused 2+ times per TTL window saves money. The risk is implementing caching badly: marking a 500-token prefix that does not qualify, putting variable content above the breakpoint, or failing to measure the hit rate after deployment. Measure before, measure after, and tune the breakpoint until hit rate is above 70%.

References

Sources

Primary sources only. No vendor-marketing blogs or aggregated secondary claims.

  1. 1 Anthropic — Prompt caching documentation (mechanics, pricing, TTL, eligibility) — accessed 2026-05-08
  2. 2 Anthropic — API pricing page (per-million-token rates including cache write/read) — accessed 2026-05-08
  3. 3 OpenAI — Prompt caching announcement and pricing (50% discount on cached input tokens) — accessed 2026-05-08
  4. 4 Anthropic — Engineering blog: Prompt caching cookbook with real-world hit-rate measurements — accessed 2026-05-08
  5. 5 Anthropic Console — Workbench cache hit/miss telemetry documentation — accessed 2026-05-08
  6. 6 Google AI — Gemini context caching pricing and TTL behavior — accessed 2026-05-08

Tools referenced in this article

Related articles

Business planning estimates — not legal, tax, or accounting advice.