Skip to content
AI/ML Engineering

LLM Prompt Caching: Anthropic vs OpenAI vs Bedrock — When It Pays Off

Anthropic 90% off with explicit breakpoints, OpenAI 50% auto, Bedrock per-region. Real cost math, when caching pays off, where to put cache markers, and the system-prompt design rules that make it work.

A
Abhishek Patel11 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

LLM Prompt Caching: Anthropic vs OpenAI vs Bedrock — When It Pays Off
LLM Prompt Caching: Anthropic vs OpenAI vs Bedrock — When It Pays Off

The Quick Answer

By 2026, prompt caching is on by default across the major providers but the implementations differ in ways that matter for production economics. Anthropic: explicit cache_control breakpoints, 5-minute and 1-hour TTLs, 90% discount on cache hits, 25% premium on cache writes. OpenAI: automatic prefix caching, 50% discount on cache hits over 1024 tokens, no manual control. AWS Bedrock: per-model cache support that mirrors the underlying provider's behavior, with regional caveats. The math: caching pays off for long system prompts (over 1024 tokens), multi-turn agent loops, and RAG with stable retrieval contexts. It doesn't pay off for one-shot calls or workflows where the prefix changes every request.

For the broader picture of what prompt caching does and the underlying mechanics, see LLM prompt caching. This article is the head-to-head provider comparison.

Last updated: April 2026 — verified Anthropic API caching docs, OpenAI's automatic prefix caching behavior, and AWS Bedrock per-model cache support.

Hero Comparison Table

AnthropicOpenAIAWS Bedrock
ActivationExplicit cache_control markersAutomatic prefix cachingMirrors underlying model + Bedrock controls
Hit discount90% off input50% off input on hits over 1024 tokensPer model, typically 50-90%
Write premium25% premium on first writeNone (no premium)Per model, typically 25%
Cache TTL5 min default, 1 hour optional~5-10 min (provider-managed)Per model + region
Manual breakpointsUp to 4 per requestNone (auto)Mirrors provider
Min tokens to cache1,024 (Sonnet/Opus), 2,048 (Haiku)1,024Per model
Cross-request sharingYes (same org + same prompt)Yes (same org)Per model
Best forMaximum control, agent loops, long system promptsSet-and-forget, stable prefixesMulti-model deployments on AWS

Anthropic: Explicit Cache Control

Anthropic's prompt caching is the most explicit of the three. You decide where the cache breakpoints go via cache_control markers in the prompt structure. The cache is keyed by the byte-exact prefix up to the breakpoint; subsequent requests that share that prefix hit the cache.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # 5-min TTL
        }
    ],
    messages=[
        {"role": "user", "content": "Hello, can you help me?"}
    ],
)

Pricing Math: Anthropic Cache

OperationSonnet 4.6 costOpus 4.7 costHaiku 4.5 cost
Standard input$3/M$15/M$0.80/M
Cache write (first time, 25% premium)$3.75/M$18.75/M$1.00/M
Cache hit (90% off)$0.30/M$1.50/M$0.08/M
Output$15/M$75/M$4/M

When Anthropic's Cache Pays Off

For a long system prompt of 8K tokens, used across 10 requests in 5 minutes:

  • Without cache: 10 × 8K × $3/M = $0.24 input cost
  • With cache: 1 × 8K × $3.75/M (write) + 9 × 8K × $0.30/M (hits) = $0.030 + $0.022 = $0.052 input cost
  • Savings: ~78%

The breakeven for Anthropic's cache: any prefix over 1,024 tokens used at least twice within 5 minutes (or 1 hour with the longer TTL). Below that threshold, the 25% write premium isn't recovered.

The 1-Hour TTL: When to Use It

Anthropic added a 1-hour TTL option in 2025. Pricing: write premium is 100% (vs 25% for 5-min). Hit discount is the same 90%. Math:

  • Worth it: when the same prefix is used across 60-minute windows with bursty access — e.g., an agent that runs every 15 minutes against a fixed 50K-token codebase context.
  • Not worth it: for high-frequency continuous traffic — the 5-min TTL covers it without the write premium.

OpenAI: Automatic Prefix Caching

OpenAI's caching is fully automatic. Send a prompt; if the prefix matches something cached recently from your org, you get the discount. No cache_control markers, no manual breakpoints.

Pricing Math: OpenAI Cache (GPT-5.4)

OperationCost
Standard input$2.50/M
Cache hit (50% off)$1.25/M
Output$10/M
Cache writeNo premium (same as standard input)

The hit discount is meaningfully smaller than Anthropic's (50% vs 90%). The lack of write premium offsets it slightly. The lack of manual control means you can't optimize prompt structure for cache stability — OpenAI decides cache boundaries internally.

When OpenAI's Auto-Cache Wins

  • Set-and-forget workflows: You don't have to think about cache markers. Sends data, gets discount.
  • Long stable system prompts: Prefix automatically cached. No work required.
  • Mixed-prefix traffic: OpenAI's cache covers shared prefixes across many request shapes; Anthropic's only covers exact-match.

When OpenAI's Auto-Cache Loses

  • Deep agent loops: Anthropic's manual control lets you cache the system prompt + first 30K of conversation, optimizing exactly the prefix that's stable. OpenAI's auto-detection doesn't always pick the same boundary.
  • RAG with stable retrieved context but variable user query: With Anthropic, you cache up to the retrieved-context boundary explicitly. With OpenAI, the auto-cache may miss the optimal boundary.
  • Cost-sensitive heavy use: 50% discount on hits is half of Anthropic's 90% — for heavy cache-hit workloads, the difference compounds.

AWS Bedrock: Per-Model Behavior

Bedrock's caching mirrors the underlying provider behavior with some Bedrock-specific quirks:

  • Anthropic models on Bedrock: Same cache_control mechanism as Anthropic direct. 90% hit discount, 25% write premium. The big caveat: cache is per-region, so US-East-1 cache doesn't apply to US-West-2.
  • OpenAI models on Bedrock: Available in select regions. Cache behavior matches OpenAI direct (auto, 50% off, 1024-token minimum).
  • Cohere / Mistral / Meta models on Bedrock: Cache support varies. Check per-model docs. Generally less mature than Anthropic/OpenAI direct.
  • Cross-region cache miss: A request that would have hit cache in US-East-1 can miss in US-West-2. For multi-region production, this is a real consideration.

When Bedrock Caching Is the Right Pick

  • Multi-model deployments where you want one billing path: Bedrock consolidates spend across providers.
  • Compliance requirements (data residency, FedRAMP) that direct provider APIs don't satisfy.
  • You're already deeply on AWS and the integration tax is paid for elsewhere.

When Bedrock Caching Loses

  • Provider feature lag: Anthropic's 1-hour TTL took longer to land on Bedrock than direct API. New features generally lag 2-4 weeks behind direct providers.
  • Per-region cache: Multi-region active-active production doubles cache costs vs a single-provider deployment.
  • Pricing margin: Bedrock prices are often slightly higher than direct provider rates (5-10%).

Implementation Patterns: Where to Put Cache Markers

Pattern 1: System Prompt Cache (Most Common)

Cache the system prompt. Variable user messages don't break the cache.

system=[
    {
        "type": "text",
        "text": SYSTEM_PROMPT,  # 5K tokens, stable
        "cache_control": {"type": "ephemeral"}
    }
]

Hit rate: 100% on continuous traffic. Cost: 1 cache write + N cheap hits.

Pattern 2: System + Tools Cache

For agent loops with many tools, cache the system prompt and tool definitions together. Tools are typically large (5-20K tokens) and stable across an agent's lifetime.

system=[
    {"type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}
],
tools=[...],  # automatically included in cache before first cache_control breakpoint

Pattern 3: System + RAG Context

For RAG, cache the system prompt AND the retrieved context. User query changes per request but the rest is stable.

messages=[
    {
        "role": "user",
        "content": [
            {"type": "text", "text": RAG_CONTEXT},  # retrieved chunks, 30K tokens
            {"type": "text", "text": "", "cache_control": {"type": "ephemeral"}},
            {"type": "text", "text": USER_QUERY}  # variable, after the cache marker
        ]
    }
]

Pattern 4: Multi-Turn Agent Cache (Advanced)

For multi-turn agents (Claude Code-style harnesses), cache up to the most recent stable point in the conversation. As the agent makes new tool calls, move the cache marker forward.

# Turn 1: cache after system prompt
# Turn 2: cache after [system, turn 1 user, turn 1 assistant, turn 1 tool result]
# Turn 3: cache after [..., turn 2 result]
# etc.

This is the pattern Claude Code uses. See Claude Code subagents and skills for the harness layer where this matters.

How to Design System Prompts for Cache Stability

  1. Lock the structure: System prompt format should be identical across requests. No date stamps, request IDs, or per-user variables in the prefix.
  2. Vary only at the end: Per-request data (user query, current state, request-specific context) goes after the cache marker.
  3. Avoid trailing whitespace and trivial differences: Even one character difference in the cached prefix breaks the cache. Be careful with template engines that add/remove whitespace.
  4. Group cache breakpoints sensibly: For Anthropic, you have up to 4 breakpoints. Use them: system prompt at one, tools at another, retrieved context at a third, conversation history at the fourth.
  5. Test cache hit rate: Anthropic's response includes cache_creation_input_tokens and cache_read_input_tokens. Monitor the ratio to verify caching is working.

Monitoring Cache Hit Rate

For production deployments, track cache hit rate per endpoint:

def log_cache_metrics(response):
    usage = response.usage
    total_input = (usage.input_tokens
                   + usage.cache_creation_input_tokens
                   + usage.cache_read_input_tokens)
    hit_rate = usage.cache_read_input_tokens / max(total_input, 1)
    print(f"Cache hit rate: {hit_rate:.1%}")

Healthy hit rates by use case:

  • Set-and-forget chat: 85-95% (system prompt is stable)
  • Multi-turn agent: 70-90% depending on harness sophistication
  • RAG with caching: 60-85% depending on retrieval-context stability
  • One-shot calls: 0% (cache doesn't help)

If you're under 50% on a workload that should be cacheable, the prefix is unstable somewhere. Common culprits: timestamps, request IDs in the system prompt, sub-100-token variation in template rendering.

Decision Matrix

WorkloadBest providerWhy
Long system prompt + chat (5K-20K tokens)Anthropic with explicit cache90% off, manual control over cache shape
Set-and-forget API integrationOpenAI auto-cacheNo code change, automatic discount
RAG with stable retrieved contextAnthropic with cache marker before user query90% off the retrieved chunks
Multi-turn agent (Claude Code-style)Anthropic with progressive cache markersBest discount on the long stable prefix
One-shot calls, varying prefixWhatever you already useCache doesn't help; pick on other axes
Multi-region active-active deploymentDirect Anthropic / OpenAIBedrock cache is per-region, doubles cost
FedRAMP / regulated cloud requirementsBedrockCompliance posture only AWS provides
Heavy ChatGPT-style integrationOpenAI auto-cache50% off automatic on stable prefixes

Pro tip: Compute the breakeven point for your workload before adopting caching. If your prefix is below 1,024 tokens or used less than 2-3 times per cache window, the write premium isn't recovered. For one-shot prompts, caching is wasted overhead.

Frequently Asked Questions

What's the best LLM prompt caching provider in 2026?

Depends on workload. Anthropic gives the deepest discount (90% off cache hits) and the most control (explicit breakpoints, 4 markers per request, 5-min and 1-hour TTLs). OpenAI's automatic prefix caching is set-and-forget at 50% off. AWS Bedrock mirrors the underlying provider but with per-region cache and slight pricing markup. For maximum cost optimization on agentic workloads, Anthropic; for set-and-forget integration, OpenAI; for AWS-native compliance, Bedrock.

When does prompt caching pay off?

Prefix over 1,024 tokens (the minimum cacheable size on most providers) used at least 2-3 times within the cache TTL window (5 minutes for the cheap option, 1 hour for the premium). Specifically: long system prompts in chat-style apps, multi-turn agent loops, RAG with stable retrieved context. Doesn't pay off for one-shot calls, varying prefixes, or sub-1024-token prompts.

Why is Anthropic's cache hit discount higher than OpenAI's?

Anthropic structurally prices to encourage caching: 90% off on hits with a 25% write premium that you only pay once. OpenAI's automatic 50% caching is more conservative — automatic, no premium, but a smaller hit discount. The product trade-off: Anthropic gives more control + bigger discount but requires explicit setup; OpenAI gives less savings but zero work.

Should I use Anthropic's 1-hour cache TTL?

Only when you have bursty traffic that hits the same prefix sporadically across a 60-minute window. The 1-hour TTL has a 100% write premium (vs 25% for 5-min). For continuous traffic, 5-min TTL covers it without paying the higher write cost. For agents that wake every 10-30 minutes against a fixed context, 1-hour is the right pick.

Does prompt caching work with streaming responses?

Yes — caching is on the input prefix, not the output. Streaming the output works normally. The response usage data (cache_read_input_tokens, cache_creation_input_tokens) is in the final stream chunk, so you can monitor hit rates even on streamed requests.

Can I cache different prefixes for different users?

Anthropic's cache is byte-exact-match keyed within an org. Different prefixes don't share cache — that's the design. For multi-tenant apps, structure your prompts so the shared part (system prompt, tool definitions) is at the start (cached), and per-user/tenant data is after (not cached). For OpenAI, automatic caching does the same thing implicitly.

Bottom Line

Prompt caching changed LLM economics significantly in 2025-26. Anthropic offers the deepest discount with the most control; OpenAI offers automatic baseline savings with zero engineering work; Bedrock mirrors the underlying provider with regional caveats. For agentic workloads with long stable prefixes, Anthropic's 90% off compounds into real savings. For set-and-forget integrations, OpenAI's automatic 50% is the path of least resistance. The wrong move is not using caching at all on workloads that would benefit — that's leaving 50-90% of input cost on the table.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.