Anthropic vs OpenAI Prompt Caching: When It Pays Off

The Quick Answer

By 2026, prompt caching is on by default across the major providers but the implementations differ in ways that matter for production economics. Anthropic: explicit cache_control breakpoints, 5-minute and 1-hour TTLs, 90% discount on cache hits, 25% premium on cache writes. OpenAI: automatic prefix caching, 50% discount on cache hits over 1024 tokens, no manual control. AWS Bedrock: per-model cache support that mirrors the underlying provider's behavior, with regional caveats. The math: caching pays off for long system prompts (over 1024 tokens), multi-turn agent loops, and RAG with stable retrieval contexts. It doesn't pay off for one-shot calls or workflows where the prefix changes every request.

For the broader picture of what prompt caching does and the underlying mechanics, see LLM prompt caching. This article is the head-to-head provider comparison.

Last updated: April 2026 — verified Anthropic API caching docs, OpenAI's automatic prefix caching behavior, and AWS Bedrock per-model cache support.

Hero Comparison Table

	Anthropic	OpenAI	AWS Bedrock
Activation	Explicit `cache_control` markers	Automatic prefix caching	Mirrors underlying model + Bedrock controls
Hit discount	90% off input	50% off input on hits over 1024 tokens	Per model, typically 50-90%
Write premium	25% premium on first write	None (no premium)	Per model, typically 25%
Cache TTL	5 min default, 1 hour optional	~5-10 min (provider-managed)	Per model + region
Manual breakpoints	Up to 4 per request	None (auto)	Mirrors provider
Min tokens to cache	1,024 (Sonnet/Opus), 2,048 (Haiku)	1,024	Per model
Cross-request sharing	Yes (same org + same prompt)	Yes (same org)	Per model
Best for	Maximum control, agent loops, long system prompts	Set-and-forget, stable prefixes	Multi-model deployments on AWS

Anthropic: Explicit Cache Control

Anthropic's prompt caching is the most explicit of the three. You decide where the cache breakpoints go via cache_control markers in the prompt structure. The cache is keyed by the byte-exact prefix up to the breakpoint; subsequent requests that share that prefix hit the cache.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # 5-min TTL
        }
    ],
    messages=[
        {"role": "user", "content": "Hello, can you help me?"}
    ],
)

Pricing Math: Anthropic Cache

Operation	Sonnet 4.6 cost	Opus 4.7 cost	Haiku 4.5 cost
Standard input	$3/M	$15/M	$0.80/M
Cache write (first time, 25% premium)	$3.75/M	$18.75/M	$1.00/M
Cache hit (90% off)	$0.30/M	$1.50/M	$0.08/M
Output	$15/M	$75/M	$4/M

When Anthropic's Cache Pays Off

For a long system prompt of 8K tokens, used across 10 requests in 5 minutes:

Without cache: 10 × 8K × $3/M = $0.24 input cost
With cache: 1 × 8K × $3.75/M (write) + 9 × 8K × $0.30/M (hits) = $0.030 + $0.022 = $0.052 input cost
Savings: ~78%

The breakeven for Anthropic's cache: any prefix over 1,024 tokens used at least twice within 5 minutes (or 1 hour with the longer TTL). Below that threshold, the 25% write premium isn't recovered.

The 1-Hour TTL: When to Use It

Anthropic added a 1-hour TTL option in 2025. Pricing: write premium is 100% (vs 25% for 5-min). Hit discount is the same 90%. Math:

Worth it: when the same prefix is used across 60-minute windows with bursty access — e.g., an agent that runs every 15 minutes against a fixed 50K-token codebase context.
Not worth it: for high-frequency continuous traffic — the 5-min TTL covers it without the write premium.

OpenAI: Automatic Prefix Caching

OpenAI's caching is fully automatic. Send a prompt; if the prefix matches something cached recently from your org, you get the discount. No cache_control markers, no manual breakpoints.

Pricing Math: OpenAI Cache (GPT-5.4)

Operation	Cost
Standard input	$2.50/M
Cache hit (50% off)	$1.25/M
Output	$10/M
Cache write	No premium (same as standard input)

The hit discount is meaningfully smaller than Anthropic's (50% vs 90%). The lack of write premium offsets it slightly. The lack of manual control means you can't optimize prompt structure for cache stability — OpenAI decides cache boundaries internally.

When OpenAI's Auto-Cache Wins

Set-and-forget workflows: You don't have to think about cache markers. Sends data, gets discount.
Long stable system prompts: Prefix automatically cached. No work required.
Mixed-prefix traffic: OpenAI's cache covers shared prefixes across many request shapes; Anthropic's only covers exact-match.

When OpenAI's Auto-Cache Loses

Deep agent loops: Anthropic's manual control lets you cache the system prompt + first 30K of conversation, optimizing exactly the prefix that's stable. OpenAI's auto-detection doesn't always pick the same boundary.
RAG with stable retrieved context but variable user query: With Anthropic, you cache up to the retrieved-context boundary explicitly. With OpenAI, the auto-cache may miss the optimal boundary.
Cost-sensitive heavy use: 50% discount on hits is half of Anthropic's 90% — for heavy cache-hit workloads, the difference compounds.

AWS Bedrock: Per-Model Behavior

Bedrock's caching mirrors the underlying provider behavior with some Bedrock-specific quirks:

Anthropic models on Bedrock: Same cache_control mechanism as Anthropic direct. 90% hit discount, 25% write premium. The big caveat: cache is per-region, so US-East-1 cache doesn't apply to US-West-2.
OpenAI models on Bedrock: Available in select regions. Cache behavior matches OpenAI direct (auto, 50% off, 1024-token minimum).
Cohere / Mistral / Meta models on Bedrock: Cache support varies. Check per-model docs. Generally less mature than Anthropic/OpenAI direct.
Cross-region cache miss: A request that would have hit cache in US-East-1 can miss in US-West-2. For multi-region production, this is a real consideration.

When Bedrock Caching Is the Right Pick

Multi-model deployments where you want one billing path: Bedrock consolidates spend across providers.
Compliance requirements (data residency, FedRAMP) that direct provider APIs don't satisfy.
You're already deeply on AWS and the integration tax is paid for elsewhere.

When Bedrock Caching Loses

Provider feature lag: Anthropic's 1-hour TTL took longer to land on Bedrock than direct API. New features generally lag 2-4 weeks behind direct providers.
Per-region cache: Multi-region active-active production doubles cache costs vs a single-provider deployment.
Pricing margin: Bedrock prices are often slightly higher than direct provider rates (5-10%).

Implementation Patterns: Where to Put Cache Markers

Pattern 1: System Prompt Cache (Most Common)

Cache the system prompt. Variable user messages don't break the cache.

system=[
    {
        "type": "text",
        "text": SYSTEM_PROMPT,  # 5K tokens, stable
        "cache_control": {"type": "ephemeral"}
    }
]

Hit rate: 100% on continuous traffic. Cost: 1 cache write + N cheap hits.

Pattern 2: System + Tools Cache

For agent loops with many tools, cache the system prompt and tool definitions together. Tools are typically large (5-20K tokens) and stable across an agent's lifetime.

system=[
    {"type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}
],
tools=[...],  # automatically included in cache before first cache_control breakpoint

Pattern 3: System + RAG Context

For RAG, cache the system prompt AND the retrieved context. User query changes per request but the rest is stable.

messages=[
    {
        "role": "user",
        "content": [
            {"type": "text", "text": RAG_CONTEXT},  # retrieved chunks, 30K tokens
            {"type": "text", "text": "", "cache_control": {"type": "ephemeral"}},
            {"type": "text", "text": USER_QUERY}  # variable, after the cache marker
        ]
    }
]

Pattern 4: Multi-Turn Agent Cache (Advanced)

For multi-turn agents (Claude Code-style harnesses), cache up to the most recent stable point in the conversation. As the agent makes new tool calls, move the cache marker forward.

# Turn 1: cache after system prompt
# Turn 2: cache after [system, turn 1 user, turn 1 assistant, turn 1 tool result]
# Turn 3: cache after [..., turn 2 result]
# etc.

This is the pattern Claude Code uses. See Claude Code subagents and skills for the harness layer where this matters.

How to Design System Prompts for Cache Stability

Lock the structure: System prompt format should be identical across requests. No date stamps, request IDs, or per-user variables in the prefix.
Vary only at the end: Per-request data (user query, current state, request-specific context) goes after the cache marker.
Avoid trailing whitespace and trivial differences: Even one character difference in the cached prefix breaks the cache. Be careful with template engines that add/remove whitespace.
Group cache breakpoints sensibly: For Anthropic, you have up to 4 breakpoints. Use them: system prompt at one, tools at another, retrieved context at a third, conversation history at the fourth.
Test cache hit rate: Anthropic's response includes cache_creation_input_tokens and cache_read_input_tokens. Monitor the ratio to verify caching is working.

Monitoring Cache Hit Rate

For production deployments, track cache hit rate per endpoint:

def log_cache_metrics(response):
    usage = response.usage
    total_input = (usage.input_tokens
                   + usage.cache_creation_input_tokens
                   + usage.cache_read_input_tokens)
    hit_rate = usage.cache_read_input_tokens / max(total_input, 1)
    print(f"Cache hit rate: {hit_rate:.1%}")

Healthy hit rates by use case:

Set-and-forget chat: 85-95% (system prompt is stable)
Multi-turn agent: 70-90% depending on harness sophistication
RAG with caching: 60-85% depending on retrieval-context stability
One-shot calls: 0% (cache doesn't help)

If you're under 50% on a workload that should be cacheable, the prefix is unstable somewhere. Common culprits: timestamps, request IDs in the system prompt, sub-100-token variation in template rendering.

Decision Matrix

Workload	Best provider	Why
Long system prompt + chat (5K-20K tokens)	Anthropic with explicit cache	90% off, manual control over cache shape
Set-and-forget API integration	OpenAI auto-cache	No code change, automatic discount
RAG with stable retrieved context	Anthropic with cache marker before user query	90% off the retrieved chunks
Multi-turn agent (Claude Code-style)	Anthropic with progressive cache markers	Best discount on the long stable prefix
One-shot calls, varying prefix	Whatever you already use	Cache doesn't help; pick on other axes
Multi-region active-active deployment	Direct Anthropic / OpenAI	Bedrock cache is per-region, doubles cost
FedRAMP / regulated cloud requirements	Bedrock	Compliance posture only AWS provides
Heavy ChatGPT-style integration	OpenAI auto-cache	50% off automatic on stable prefixes

Pro tip: Compute the breakeven point for your workload before adopting caching. If your prefix is below 1,024 tokens or used less than 2-3 times per cache window, the write premium isn't recovered. For one-shot prompts, caching is wasted overhead.

Frequently Asked Questions

What's the best LLM prompt caching provider in 2026?

Depends on workload. Anthropic gives the deepest discount (90% off cache hits) and the most control (explicit breakpoints, 4 markers per request, 5-min and 1-hour TTLs). OpenAI's automatic prefix caching is set-and-forget at 50% off. AWS Bedrock mirrors the underlying provider but with per-region cache and slight pricing markup. For maximum cost optimization on agentic workloads, Anthropic; for set-and-forget integration, OpenAI; for AWS-native compliance, Bedrock.

When does prompt caching pay off?

Prefix over 1,024 tokens (the minimum cacheable size on most providers) used at least 2-3 times within the cache TTL window (5 minutes for the cheap option, 1 hour for the premium). Specifically: long system prompts in chat-style apps, multi-turn agent loops, RAG with stable retrieved context. Doesn't pay off for one-shot calls, varying prefixes, or sub-1024-token prompts.

Why is Anthropic's cache hit discount higher than OpenAI's?

Anthropic structurally prices to encourage caching: 90% off on hits with a 25% write premium that you only pay once. OpenAI's automatic 50% caching is more conservative — automatic, no premium, but a smaller hit discount. The product trade-off: Anthropic gives more control + bigger discount but requires explicit setup; OpenAI gives less savings but zero work.

Should I use Anthropic's 1-hour cache TTL?

Only when you have bursty traffic that hits the same prefix sporadically across a 60-minute window. The 1-hour TTL has a 100% write premium (vs 25% for 5-min). For continuous traffic, 5-min TTL covers it without paying the higher write cost. For agents that wake every 10-30 minutes against a fixed context, 1-hour is the right pick.

Does prompt caching work with streaming responses?

Yes — caching is on the input prefix, not the output. Streaming the output works normally. The response usage data (cache_read_input_tokens, cache_creation_input_tokens) is in the final stream chunk, so you can monitor hit rates even on streamed requests.

Can I cache different prefixes for different users?

Anthropic's cache is byte-exact-match keyed within an org. Different prefixes don't share cache — that's the design. For multi-tenant apps, structure your prompts so the shared part (system prompt, tool definitions) is at the start (cached), and per-user/tenant data is after (not cached). For OpenAI, automatic caching does the same thing implicitly.

Bottom Line

Prompt caching changed LLM economics significantly in 2025-26. Anthropic offers the deepest discount with the most control; OpenAI offers automatic baseline savings with zero engineering work; Bedrock mirrors the underlying provider with regional caveats. For agentic workloads with long stable prefixes, Anthropic's 90% off compounds into real savings. For set-and-forget integrations, OpenAI's automatic 50% is the path of least resistance. The wrong move is not using caching at all on workloads that would benefit — that's leaving 50-90% of input cost on the table.

LLM Prompt Caching: Anthropic vs OpenAI vs Bedrock — When It Pays Off