What Is Prompt Caching and When Does It Actually Save Money?
Prompt caching tells an LLM provider to remember the unchanging prefix of your prompt (system instructions, RAG documents, examples) so subsequent calls reuse the cached attention state. You pay roughly 10% of the regular input price on cache hits. It saves money when the same prefix is reused at least 2-3 times within the cache TTL — typical for RAG, multi-turn agents, and document-heavy chat. It does not save money for one-shot, varied queries.
Key Insight
Prompt caching tells an LLM provider to remember the unchanging prefix of your prompt (system instructions, RAG documents, examples) so subsequent calls reuse the cached attention state. You pay roughly 10% of the regular input price on cache hits. It saves money when the same prefix is reused at least 2-3 times within the cache TTL — typical for RAG, multi-turn agents, and document-heavy chat. It does not save money for one-shot, varied queries.
What Is Prompt Caching?
Prompt caching is a feature where an LLM provider stores the attention state of the unchanging prefix of your prompt — system instructions, retrieved documents, examples — so subsequent calls reuse it instead of paying to recompute it. On Claude 4.7 you pay 10% of the regular input price for cache hits, on GPT-5 you pay 50%, and on Gemini 2.5 you pay 25%. For production apps with stable system prompts, this routinely cuts the LLM bill by 60-90% with zero quality change.
That's the 50-word answer. The rest of this post explains exactly how it works, when it pays back, and the code you need to ship it today.
How Does Prompt Caching Work?
Every LLM call goes through three phases on the provider's GPUs:
- Tokenization — converting your text into integer tokens
- Prefill — running attention over every token in the prompt to build the KV cache
- Decode — generating output tokens one at a time
Prefill is the expensive part. For a 10,000-token prompt, prefill consumes ~95% of the latency and ~98% of the input-side compute cost. The decode phase is comparatively cheap.
Prompt caching exploits a simple fact: if today's prompt starts with the same 8,000 tokens as yesterday's, the prefill work is identical. So providers added the ability to save the KV cache to fast storage (typically host RAM or a distributed cache layer) and replay it on the next matching call, paying only for the new suffix tokens at full price.
The engineering details vary, but the user-facing API contract is consistent across providers: mark a prefix as cacheable, and pay a discounted price when subsequent calls hit it.
The Anthropic prompt caching docs, the OpenAI prompt caching guide, and the Google Gemini context caching docs describe each provider's exact mechanics.
When Does Prompt Caching Save Money?
The break-even math is simple. Prompt caching saves money when:
(cache_write_price - cache_read_price) / (base_input_price - cache_read_price) < expected_hits_per_write
In plain English: you need enough cache reads to amortize the small extra cost of the cache write. On each provider:
- Claude 4.7: Write costs 1.25x base, read costs 0.1x base. Break-even = 1 read. Two reads is pure savings, ten reads is dramatic savings.
- GPT-5: Write is implicit (no extra cost), read costs 0.5x base. Break-even = 1 read. Every cache hit saves money.
- Gemini 2.5: Write is implicit for auto-caching, explicit caching has a small storage fee. Read costs 0.25x base.
The workloads where this math overwhelmingly works:
- RAG with stable system prompts — same instructions and same document chunks reused thousands of times per day
- Multi-turn conversational agents — every turn re-sends the entire conversation history, which becomes the next turn's cached prefix
- Customer support copilots — long product manual or knowledge base loaded once, queried thousands of times
- Codebase-aware coding assistants — entire repository (or large chunks) sent as context across many edits
- Few-shot prompting at scale — same 10-20 example prompts reused for every classification or extraction call
When Is Prompt Caching NOT Worth It?
Three scenarios where caching costs you money or has no effect:
- One-shot queries with no shared structure — if every call has a totally unique prompt, there's nothing to cache
- Very short prompts — most providers have a minimum token threshold (Claude requires 1,024 tokens for caching to engage; GPT-5 and Gemini have similar floors). Below that threshold, the cache write overhead doesn't pay back.
- High-variability prefixes — if your "system prompt" changes per user (e.g. you inject the user's full profile at the top), the cache invalidates on every call and you pay write price with no read benefit. Restructure to put the variable bits at the end, not the start, of the prompt.
Rule of thumb: if your shared prefix is under 1,024 tokens or you reuse it fewer than 2 times per cache TTL, skip caching.
Which Providers Offer Prompt Caching in 2026?
| Provider | Mechanism | Cache write price | Cache read price | TTL | Min tokens |
|---|---|---|---|---|---|
| --- | --- | --- | --- | --- | --- |
| Claude 4.7 (Sonnet) | Explicit cache_control | 1.25x base ($3.75/M) | 0.1x base ($0.30/M) | 5 min default, 1 hour optional | 1,024 |
| Claude 4.7 (Opus) | Explicit cache_control | 1.25x base ($18.75/M) | 0.1x base ($1.50/M) | 5 min / 1 hour | 1,024 |
| GPT-5 | Automatic prefix detection | Same as base ($5/M) | 0.5x base ($2.50/M) | ~5 min idle | 1,024 |
| Gemini 2.5 Pro | Implicit (auto) + explicit context caching | Same as base ($1.25/M) + small storage fee for explicit | 0.25x base ($0.31/M) | 5 min default, configurable | 1,024-4,096 (varies by mode) |
Always check live pricing at the source: Anthropic pricing, OpenAI pricing, Google AI pricing.
The shapes are different but the punchline is the same: every major provider now offers a 50-90% discount on the reused part of your prompt.
How to Implement Prompt Caching
Here's a concrete TypeScript example using the Anthropic SDK with Claude 4.7. The same pattern translates directly to GPT-5 (which auto-detects) and Gemini (which uses similar explicit cache markers).
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic();
// A long, stable system prompt + RAG context. This is the part we want cached.
const SYSTEM_INSTRUCTIONS = `You are a senior backend engineer helping
debug production issues at Acme Corp. You have access to the codebase,
runbooks, and incident history. Always respond with concrete next steps.
[... 4,000 more tokens of instructions ...]`;
const RAG_CONTEXT = `<runbooks>
[... 6,000 tokens of retrieved runbooks for the current incident ...]
</runbooks>`;
async function answerIncidentQuestion(userQuestion: string) {
return await anthropic.messages.create({
model: 'claude-sonnet-4-7-20260506',
max_tokens: 1024,
system: [
{
type: 'text',
text: SYSTEM_INSTRUCTIONS,
cache_control: { type: 'ephemeral' },
},
{
type: 'text',
text: RAG_CONTEXT,
cache_control: { type: 'ephemeral' },
},
],
messages: [
{
role: 'user',
content: userQuestion,
},
],
});
}Two things worth noting:
- The cache_control breakpoints split the prompt into cacheable segments. Anthropic supports up to 4 breakpoints, so you can cache different prefixes that change at different rates (e.g. system prompt that never changes, RAG context that rotates daily, conversation history that grows per turn).
- The user question is *not* cached — it's the variable part. Put variable content at the end of the prompt, not the start.
For the 1-hour cache variant on Claude 4.7, change { type: 'ephemeral' } to { type: 'ephemeral', ttl: '1h' }.
GPT-5 requires no code changes — caching is automatic when you reuse prompt prefixes. The OpenAI prompt caching docs explain how to monitor cache hit rate via the API response.
For Gemini, the explicit context caching API lets you create a cache once and reuse the cache name across calls, which is more deliberate but gives you fine-grained TTL control.
Worked Example: 1,000 Calls Per Day With 8K Reused Context
Let's calculate the actual savings on a typical production workload:
- 1,000 calls/day
- 8,000 tokens of reused system prompt + RAG context per call
- 200 tokens of unique user content per call
- 500 tokens of output per call
- Cache writes assumed once per 5-minute window (peak case, bottom-line will be higher)
Without caching
- Input: 8,200 tokens × 1,000 calls = 8.2M tokens/day
- Output: 500 tokens × 1,000 calls = 0.5M tokens/day
| Provider | Daily input cost | Daily output cost | Daily total |
|---|---|---|---|
| --- | --- | --- | --- |
| Claude 4.7 Sonnet | $24.60 | $7.50 | $32.10 |
| GPT-5 | $41.00 | $10.00 | $51.00 |
| Gemini 2.5 Pro | $10.25 | $5.00 | $15.25 |
With caching (assume 95% cache hit rate)
For Claude 4.7 Sonnet:
- Cached input (95%): 7,790 tokens × 1,000 × $0.30/M = $2.34
- Uncached input (5% writes + variable suffix): 410 + 200 = 610 tokens × 1,000 × $3.75/M cache write or $3/M = ~$1.95
- Output: $7.50
- Total: ~$11.79/day (vs $32.10 without — 63% savings)
For GPT-5:
- Cached input (95%): 7,790 tokens × 1,000 × $2.50/M = $19.48
- Uncached input (5% + variable): ~$3.05
- Output: $10.00
- Total: ~$32.53/day (vs $51 without — 36% savings)
For Gemini 2.5 Pro:
- Cached input (95%): 7,790 tokens × 1,000 × $0.31/M = $2.42
- Uncached input: ~$0.76
- Output: $5.00
- Total: ~$8.18/day (vs $15.25 without — 46% savings)
Use our AI cost calculator to plug in your own numbers and see the break-even point on your workload.
Decision Matrix: Should You Cache This Prompt?
| Scenario | Cache? | Why |
|---|---|---|
| --- | --- | --- |
| Stable system prompt > 1K tokens, reused 100+ times/day | Yes | Massive savings, near-zero risk |
| RAG with rotating documents but stable instructions | Yes (cache the instructions only) | Use multiple cache_control breakpoints |
| Multi-turn agent with growing history | Yes | Each turn's history becomes the next turn's cache hit |
| One-shot extraction with unique prompts each time | No | Nothing reused, no savings possible |
| Short prompts (< 1K tokens) | No | Below minimum cache token threshold |
| Variable user-specific data injected at top of prompt | Refactor first | Move variable data to end, then cache |
| Real-time streaming chat with unique system per session | Yes (cache per session) | Cache survives across that session's turns |
Common Pitfalls
A few mistakes we see in production code reviews:
- Putting timestamps or request IDs in the system prompt. This invalidates the cache on every call. Move them to a tool-call payload or the user message instead.
- Tool schemas changing between calls. If you reorder or modify tool definitions, the cache breaks. Keep tools stable, or cache only the system prompt without tools.
- Not monitoring cache hit rate. Both Anthropic and OpenAI return cache statistics in the response object. Log them. A cache hit rate below 70% in production means something upstream is breaking your prefix.
- Caching tiny prompts. If your prefix is 500 tokens, the write overhead may exceed the read savings. Use caching for prefixes ≥ 1,024 tokens.
- Forgetting TTL. A cache that expires before the next call gives you all the cost of the write and none of the read benefit. For sparse traffic, prefer Claude's 1-hour cache or Gemini's explicit context caching with a long TTL.
If your agent is also leaking context between calls (a related but distinct failure mode), our guide on why your AI agent loses context covers that side of the same problem.
How Prompt Caching Plays With Model Choice
The frontier-model comparison in our Claude 4.7 vs GPT-5 vs Gemini 2.5 Deep Think head-to-head shows that the three models trade off speed, quality, and price differently. Caching changes the price axis dramatically:
- Claude 4.7 Sonnet with caching lands at roughly $1.50/M effective input price for typical production workloads — cheaper than GPT-5 without caching
- Gemini 2.5 Pro with caching lands at roughly $0.50/M effective input price — by far the cheapest frontier option
- GPT-5 with caching lands at roughly $2.75/M effective — competitive but no longer the obvious pick on cost
If you've been defaulting to GPT-5 because of price perception, re-run the math with caching turned on. Many teams find the picture flips entirely.
Bottom Line
Prompt caching is the most effective single optimization available for production LLM apps in 2026. It is supported by every frontier provider, requires minimal code changes, and routinely cuts bills by 60-90% on workloads with stable prefixes.
The decision is binary: if your prompt has a reused prefix of at least 1,024 tokens that's seen at least 2 times within the cache TTL, turn caching on. Otherwise, don't bother.
For more frontier-model decisions, see our Claude 4.7 vs GPT-5 vs Gemini 2.5 head-to-head and the broader best AI tools for developers in 2026 roundup.
For the bigger picture on production LLM cost optimization, see our pillar guide on [the best AI tools for developers in 2026](/blog/best-ai-tools-for-developers-2026).
Key Takeaways
- Prompt caching reduces the cost of reused prompt prefixes to roughly 10% of the regular input price on Claude 4.7, 50% on GPT-5, and 25% on Gemini 2.5
- Cache writes cost slightly more than regular input (1.25x on Claude, free but with min-token thresholds on GPT-5 and Gemini)
- Break-even on Claude 4.7 is roughly 2 cache hits per write — anything beyond that is pure savings
- Cache TTL matters more than people think: Claude offers 5-minute ephemeral and 1-hour cache, OpenAI auto-invalidates after ~5 minutes idle, Gemini lets you set it explicitly
- Caching saves the most money on long, stable system prompts and RAG document context — exactly the kind of workload most production apps already have
- It is not free: cache misses (after TTL expiry or prefix change) cost more than no cache, so structure your prompts to maximize hit rate
- For high-volume production, prompt caching often cuts the LLM bill by 60-90% with zero quality change
Frequently Asked Questions
What is prompt caching in simple terms?
Prompt caching is a feature where you tell an LLM provider to remember the unchanging part of your prompt — usually the system instructions, retrieved documents, or few-shot examples — so the next call that starts with the same prefix doesn't pay full price. Instead of re-tokenizing and re-attending to every word, the provider replays the cached state. You typically pay 10-50% of the regular input price for cached tokens, with the exact discount depending on the provider.
How does prompt caching work technically?
Under the hood, the provider stores the key-value attention state (KV cache) of the prefix on a GPU after your first call. On the next call with the same prefix, instead of recomputing attention from scratch, the provider reuses the stored KV cache and only computes attention for the new tokens. This is dramatically cheaper because attention is the most expensive part of inference, and it is also why cached calls are usually faster — first-token latency drops by 30-80%.
When does prompt caching actually save money?
Prompt caching saves money any time the same prefix is reused at least 2-3 times within the cache TTL. Typical winners: RAG applications with stable system prompts and reused document context, multi-turn agents where every turn re-sends the conversation history, customer support copilots with a long product manual in context, and code assistants with codebase context. It does not save money for genuinely one-shot queries with no shared structure.
When is prompt caching NOT worth it?
Caching is wasted (and slightly more expensive than no cache) in three scenarios: one-shot queries where every call has a different prefix, very short prompts where the cache write overhead dominates, and high-variability workloads where prefix changes invalidate the cache before you get hits. As a rule of thumb, if your prefix is under 1,024 tokens or you reuse it fewer than twice per cache TTL, skip caching.
Which LLM providers support prompt caching in 2026?
All three frontier providers support it. Claude 4.7 offers explicit cache_control breakpoints with a 5-minute ephemeral cache and an optional 1-hour cache, charging 10% of base input price on cache reads. GPT-5 offers automatic prefix caching with a ~5-minute idle TTL and a 50% discount on cached input. Gemini 2.5 supports both implicit (auto, 5-minute) and explicit context caching with a 25% discount on cached input. Bedrock, Vertex AI, and Azure OpenAI all expose the same caching primitives as their underlying providers.
How much can prompt caching cut my LLM bill?
In production workloads with stable system prompts and reused RAG context, prompt caching typically cuts total LLM spend by 60-90% with zero change in output quality. The exact savings depend on three factors: how much of your prompt is reused (the bigger the cached prefix, the bigger the savings), how often you hit the cache (more hits per write = more savings), and which provider you use (Claude's 0.1x cache read price is the most aggressive discount).
Does prompt caching work with streaming and tool use?
Yes on all three providers. Claude 4.7's cache_control works seamlessly with streaming, tool use, and extended thinking. GPT-5's automatic caching applies to chat completions and the Assistants API including tool calls. Gemini 2.5's context caching is compatible with function calling, system instructions, and multimodal inputs. The one constraint is that anything that changes the prefix (including the order of tools or the system prompt) breaks the cache.
What is the difference between Claude 4.7 5-minute and 1-hour cache?
The 5-minute ephemeral cache (the default) costs 1.25x base input price to write and refreshes automatically every time it's read. The 1-hour cache costs 2x base input price to write but holds for a full hour without needing reads. Use the 1-hour variant for low-traffic but high-value caches (overnight batch jobs, sparse RAG queries) and the 5-minute ephemeral cache for high-traffic, sustained workloads where reads will keep refreshing the cache anyway.
About the Author
Aisha Patel
AI Editorial Desk
AI Editorial Desk · Web3AIBlog
Aisha Patel is a pen name for our AI editorial desk. Posts under this byline are written and reviewed by our team of contributors with backgrounds in machine learning, large language models, AI infrastructure, and applied research. The desk covers frontier model releases, agent architectures, retrieval-augmented generation, on-device inference, and the engineering tradeoffs that matter when shipping AI in production. Every technical claim is verified against primary sources before publication.