Skip to content
AI/ML Engineering

LLM Prompt Caching: Cut API Costs 90%

Prompt caching cuts LLM API bills 50-90% by reusing the KV cache for stable prefixes. Anthropic, OpenAI, Gemini, and vLLM compared with real pricing, implementation patterns, and four workload simulations.

A
Abhishek Patel15 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

LLM Prompt Caching: Cut API Costs 90%
LLM Prompt Caching: Cut API Costs 90%

The 90% Discount Hiding in Your API Bill

Prompt caching is the single biggest lever for cutting LLM API costs in 2026, and most teams are leaving it on the floor. If you're sending the same 20k-token system prompt on every request, you're paying full price for tokens the provider already has in memory. Turn caching on and the same call drops from roughly $0.003 per request to $0.0003 — a 10x cost reduction on the static portion of your prompt, with no change to output quality. I've watched a customer-support agent's monthly bill fall from $4,200 to $680 with one afternoon of prompt restructuring.

This piece walks through what server-side prompt caching actually is, how the four dominant providers (Anthropic, OpenAI, Google, vLLM for self-hosted) implement it, the concrete patterns to restructure your prompts for cache hits, and the provider-by-provider cost math you need to run before picking one. The edge cases — cache-miss debugging, multi-turn state, long-context TTL tradeoffs — are what I send to the newsletter; what's below is the production baseline.

Last updated: April 2026 — verified Anthropic 5m/1h TTL pricing multipliers, OpenAI 50% automatic discount with 1024-token minimum, Gemini 2.5/3 explicit cache discount bands (90%/75%), and vLLM APC hash-table semantics.

What Prompt Caching Actually Does (and Doesn't)

Definition: Prompt caching is provider-side storage of the attention-layer key-value (KV) cache for a stable prompt prefix. When a later request arrives with the same prefix, the model skips the prefill computation for those tokens and processes only the new suffix. It's not response caching — every request still produces fresh output — it's compute reuse on the input side.

The confusion is worth untangling up front. Client-side response caching (hitting Redis for a prior model answer to an identical query) is a different tool with different tradeoffs — see caching strategies for the broader taxonomy. Prompt caching lives on the provider's GPU, uses the same mechanism as the serving engine's KV cache, and is tightly coupled to how LLM inference uses tokens, context windows, and the KV cache. The mental model: the provider keeps a warm GPU state for your prefix and charges you a fraction of input cost to read from it.

Two hard rules govern what can be cached:

  • Prefix must be byte-identical. One stray whitespace character in your system prompt and the hash misses. Every byte before your first dynamic token must match the cached version exactly.
  • Caching has a minimum size. OpenAI's threshold is 1,024 tokens; Anthropic's varies per model (1,024 for Haiku, 2,048 for Sonnet/Opus); Gemini's explicit cache has no hard floor but the storage billing makes sub-4k-token caches uneconomical.

How the KV Cache Gets Reused

Under the hood, the transformer computes attention by turning each token into a key and a value vector. During generation, those vectors get reused for every subsequent token — that's the PagedAttention idea vLLM built its serving engine around. Prompt caching extends the same optimization across requests: the provider keeps the K/V tensors for your prefix in GPU memory (or a tiered cache behind it) and maps incoming requests to them by hash.

flowchart LR
  A[Request 1: 20k system + user A] -->|prefill all 20k tokens| B[GPU fills KV cache]
  B -->|store prefix hash| C[(KV cache store)]
  D[Request 2: 20k system + user B] -->|hash match on prefix| C
  C -->|reuse 20k KV tensors| E[GPU prefill only user B]
  E --> F[Generate response B]

The wins compound in three places: input token cost (you pay the cache-read rate, typically 10-50% of base), time-to-first-token latency (up to 80% reduction per OpenAI's own numbers), and throughput on self-hosted serving engines where freed GPU cycles serve other requests. Throughput matters less on managed APIs where the provider absorbs it, but it's the primary reason vLLM users turn APC on.

Provider Comparison: Anthropic vs OpenAI vs Google vs vLLM

ProviderActivationCache-write costCache-read costDefault TTLMin tokens
AnthropicExplicit cache_control block1.25x input (5m) / 2x input (1h)0.1x input (90% off)5 min (paid 1h)1,024-2,048
OpenAIAutomatic, opt-outNo premium0.5x input (50% off)~5-10 min opportunistic1,024
Google GeminiExplicit cachedContents API1x input + storage/hr0.1x input on 2.5 (90%), 0.25x on 3 Pro (75%)60 min (user-set)~4k recommended
vLLM (self-hosted)Flag --enable-prefix-cachingFree (GPU cycles)Free (GPU cycles)Until LRU evictionBlock size (16 tokens default)

Anthropic: explicit control, highest discount

Anthropic's model requires you to mark where the cache ends with a cache_control block. You can place up to four breakpoints in a single request, which lets you cache a tools-block, a system prompt, and a long RAG context independently. Cache writes cost 1.25x the base input rate for 5-minute TTL and 2x for the 1-hour TTL tier; reads are 0.1x regardless. The break-even is one cache read for 5m, two for 1h — easy to hit for any repeat workload.

As of Q1 2026 there's a live gotcha: Anthropic silently regressed the default TTL to 5 minutes for some Claude Code sessions in early March 2026, causing surprise cost inflation for teams that assumed 1h. Always pass an explicit "ttl": "1h" if you need the longer window. For the Claude Opus 4.7 and how Opus 4.7 stacks against GPT-5.4, the cache multiplier is the same 1.25x/2x — only the base price changes.

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const response = await client.messages.create({
  model: 'claude-opus-4-7-20260115',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: LARGE_STATIC_SYSTEM_PROMPT,
      cache_control: { type: 'ephemeral', ttl: '1h' },
    },
  ],
  messages: [{ role: 'user', content: userQuery }],
});

// response.usage gives you:
// cache_creation_input_tokens: paid at 2x (first call only)
// cache_read_input_tokens: paid at 0.1x (every subsequent hit)

OpenAI: automatic, opt-out, 50% off

OpenAI turned on automatic prefix caching in late 2024 and the mechanism has only gotten more aggressive since. You do nothing — any prompt above 1,024 tokens, sent to one of the latest GPT-4o/4.1/5.x or o-series snapshots, routes to a server that recently processed the same prefix. When it does, the cached input tokens are billed at half the base rate. Cache hits advance in 128-token increments, so a 1,200-token prefix that's 1,100 tokens cached will charge full price on the final 100 and 50% on 1,024 of the hit.

The usage field on the response returns cached_tokens so you can measure hit rate in production. The tradeoff: no control over TTL (it's opportunistic, usually 5-10 minutes), no way to pin a prefix, and no 1-hour tier. The flat 50% discount is also less aggressive than Anthropic's 90%, but the zero-config ergonomics are genuinely nice.

Google Gemini: explicit with storage pricing

Gemini's cachedContents API takes a different shape: you create a named cache object with a TTL, get back a cache ID, and reference it in subsequent requests. Read discount on Gemini 2.5 is 90% (matching Anthropic); on Gemini 3 Pro it's 75%. You also pay for the cache to exist — Pro storage runs $4.50/M tokens/hour, Flash runs $1.00/M tokens/hour. A 100k-token cache kept alive for 24 hours costs roughly $10.80 (Pro) or $2.40 (Flash) just in storage, before any reads.

That storage-billed model inverts the math: Gemini caching pays off on high-volume workloads (hundreds of reads per hour over a stable prefix) and loses money on long-tail patterns where you cache, read twice, and let it expire. Run the break-even: storage_cost_per_hour / (per_read_savings) = reads-per-hour needed to justify the cache. Below that, you're better off not caching on Gemini.

vLLM: free on your own GPUs

For self-hosted inference, vLLM ships with automatic prefix caching behind the --enable-prefix-caching flag (default on in recent versions). vLLM hashes KV-cache blocks of size 16 tokens, keeps a global hash table of physical blocks, and routes matching-prefix requests to the same physical pages. Eviction is LRU. The cost is GPU memory — cached blocks occupy KV cache capacity that could otherwise batch fresh requests — so tune --kv-cache-memory against your concurrency profile. If you're weighing self-hosted vs managed, how Ollama, vLLM, and llama.cpp compare for local inference is the broader frame.

Pricing Math: Four Real Workloads

Theoretical discounts mean nothing without a workload. These four are drawn from production systems I've deployed or audited — numbers are monthly, assume $3/$15 per million input/output tokens (Claude Sonnet 4 baseline) or $2/$8 (GPT-5 baseline), and exclude batch discounts.

WorkloadPrefix sizeCalls/monthNo cacheAnthropic 1hOpenAI auto
RAG support agent20k tokens200,000$12,000$1,200 (90% off)$6,000 (50% off)
Code-editing agent (full repo)80k tokens50,000$12,000$1,200$6,000
Chatbot (5k sys prompt)5k tokens1,000,000$15,000$1,500$7,500
Document Q&A (150k context)150k tokens10,000$4,500$450$2,250

The code-editing agent case is the most dramatic. We migrated a production agent at a 12-engineer shop from GPT-4.1-no-cache to Claude Sonnet 4 with 1h caching in November 2025; the monthly bill fell from $8,900 to $1,350 and the p50 latency dropped 40% because prefill was skipped. The first time we deployed the change without the cache_control block we burned $1,200 in a day rebuilding the cache on every request — the lesson: monitor cache_creation_input_tokens the first week and alert if it's not trending to near-zero.

How to Structure Prompts for Cache Hits

The rule is one-liner simple and the implementation is nine times out of ten where teams fail: static content first, dynamic content last, byte-identical across requests. Every dynamic piece — user IDs, timestamps, retrieval chunks — pushes downstream of the cache boundary.

  1. Audit your prompt template. Print the full prompt for 10 consecutive production calls and diff them. Every byte that differs between the start and the first dynamic field is a cache-miss source. Common culprits: timestamps in system prompts, ISO-formatted dates, session IDs, user names.
  2. Pin the stable block. Move everything stable (role definition, tool schemas, few-shot examples, static retrieval context) to the top. Insert the cache breakpoint (Anthropic) or rely on prefix matching (OpenAI/vLLM) right after it.
  3. Use a cache-friendly RAG ordering. Counterintuitively, cache the retrieval block too when it's stable for a session — put retrieved docs above the user query, not interleaved with it. See how RAG actually ships to production for the full picture; caching is one of the steps that turns a demo into real economics.
  4. Normalize whitespace. Strip trailing newlines, tabs, and smart quotes before building the prompt. One invisible U+00A0 non-breaking space is enough to miss.
  5. Measure hit rate per request. Log cache_read_input_tokens / (cache_read_input_tokens + input_tokens) per call. Target >85% on any mature deployment; below 70% means a structural problem, not a tuning one.
  6. Respect TTL. If your call frequency is under one-per-5-minutes, 5m cache is burning cache-write premiums for no gain. Either batch requests, switch to 1h TTL, or accept you don't qualify for caching.
  7. Add cache-miss alerting. A sudden spike in cache_creation_input_tokens on Anthropic or a drop in cached_tokens on OpenAI is your canary for a bug or deploy-induced prompt drift. Wire it into your AI agent framework observability.
LLM prompt caching cost optimization dashboard showing API token usage and savings over time
Monitor cache hit rate per endpoint — anything below 70% signals a prompt-drift bug or an unsuitable workload.

Debugging Cache Misses in Production

Caching is invisible when it works and maddening when it doesn't. The four bugs I've seen most often:

  • Template substitution drift. Jinja or string templates that include timestamps, UUIDs, or OS-specific line endings. Fix: freeze the prefix at deploy time, generate it once, inject dynamic fields below the cache boundary.
  • Model version mismatch. Caches are per-model-snapshot. Upgrading from gpt-4o-2024-08-06 to gpt-4o-2024-11-20 invalidates every warm cache in your account. Schedule upgrades with expected cost spikes for the first hour.
  • Tool-block reordering. On Anthropic, tools appear before system in the hash order. If your SDK wrapper sorts tool definitions alphabetically on one deploy and by registration order on another, every call misses.
  • Region drift. Cache is regional. A request routed to us-east-1 won't hit a cache warmed in us-west-2. Pin the region if you care about hit rate, especially on LLM API pricing-sensitive workloads.

Watch out: Cache metrics lag by 10-30 seconds on most providers. If you deploy a prompt change and immediately check hit rate, you'll see a false drop. Wait two minutes before panicking — and keep a rollback plan ready.

The subtlest miss I debugged cost a startup a $3,100 weekend: they added a trace-ID field to their OpenTelemetry-instrumented prompt builder, injected at position three inside the system message. Hit rate dropped from 94% to 11% in under an hour. The fix took six lines — move the trace ID to a metadata field rendered below the user query — but the postmortem took two days because nobody thought to diff the byte sequences. The cost accounting rule: any instrumentation that touches the prompt body needs a cache-regression test in CI. Run a golden-prompt build, hash it, compare against the last-known-good hash; fail the build on drift.

When to Pick Which Provider

  • Pick Anthropic if: You're running repeat-prefix workloads (agents, RAG, code editors) at scale and want the deepest 90% discount. The 1h TTL tier makes the math work below once-per-5-minutes cadence. Explicit control is worth the SDK complexity.
  • Pick OpenAI if: You want zero-config ergonomics, your workloads have naturally high repetition at 5-10 minute cadence, and 50% off is enough savings to ship. Also the right pick if you haven't adopted caching yet — it happens automatically the moment you cross 1,024 tokens.
  • Pick Google Gemini if: You're doing high-volume, stable-prefix workloads (>100 reads/hour) where storage cost amortizes cheaply, or you need the 1M+ token context window. Skip Gemini caching for long-tail patterns — the storage billing eats you alive.
  • Pick vLLM self-hosted if: You're already running your own GPUs, your workload saturates the hardware, and caching wins come from throughput (more requests/sec) rather than cost (which is sunk). Pair with a good GPU cloud if you're not on-prem.

Frequently Asked Questions

How much does prompt caching actually save in practice?

For workloads with stable 10k+ token prefixes, real bills fall 70-90% on Anthropic, 35-50% on OpenAI, and 60-85% on Gemini 2.5. The variance comes from hit rate — a 95% hit rate on Anthropic caching realizes the full 90% discount; 60% hit rate realizes roughly 55%. Measure before modeling: instrument cache_read_input_tokens on every call and compute savings against your pre-cache baseline.

Does prompt caching work with streaming responses?

Yes on all four providers — caching affects prefill, not output generation, so streaming is orthogonal. The time-to-first-token improvement is actually most visible in streaming UIs because users see the first token arrive 40-80% faster on cache hits. No code change needed; just stream as usual and the cache applies to the input side.

Can I cache parts of a conversation that change each turn?

Yes, but only the stable prefix up to the first change. Anthropic supports up to 4 cache breakpoints, so you can cache system prompt, tool definitions, and retrieved context independently — each turn invalidates only the cache after the first mutated block. Structure: tools (cached) → system (cached) → retrieved docs (cached if stable) → conversation history (not cached beyond the first dynamic message) → new user message.

Is prompt caching the same as RAG caching?

No — different layer entirely. Prompt caching is provider-side KV-cache reuse for identical prompt prefixes. RAG caching is your application caching retrieved documents (vector-DB results) to avoid re-retrieving for similar queries. They compose: cache retrievals in Redis client-side, then use prompt caching on the provider to cache the system prompt. Combining both on a well-tuned agent cuts cost 94% vs baseline.

What is the minimum prompt size for caching to work?

OpenAI requires 1,024 tokens minimum with 128-token hit increments. Anthropic's minimum varies — 1,024 tokens for Haiku, 2,048 for Sonnet and Opus. Gemini has no hard floor on the API but storage billing makes anything under ~4,000 tokens economically dubious. vLLM APC works at any size because its block is 16 tokens, though effective wins only appear when prefixes exceed a few hundred tokens.

Does prompt caching affect model output quality?

No — zero quality impact. The cached KV tensors are mathematically identical to what the model would compute fresh. Every output token still goes through the full generation pipeline; caching only avoids recomputing the prefill stage. This is verifiable: run the same prompt with and without cache_control and the outputs will match for deterministic (temperature=0) requests.

How do I monitor prompt cache hit rate in production?

All providers return per-call cache metrics in the usage payload. Anthropic: cache_creation_input_tokens (write) and cache_read_input_tokens (hit). OpenAI: cached_tokens inside prompt_tokens_details. Gemini: cachedContentTokenCount. Log these to Prometheus or Datadog, compute the ratio hit/(hit+full), and alert on drops below 70% or sudden spikes in cache writes. Dashboarding this in week one pays for itself the first time prompt drift happens.

The Bottom Line

Prompt caching is mature, widely supported, and the highest-ROI optimization available to anyone shipping LLM-backed systems in 2026 — a straight 50-90% cut on the largest line item in most AI bills. The work is 90% prompt hygiene (static-first ordering, byte-identical prefixes, dynamic content pushed to the end) and 10% provider-specific SDK glue. Do the audit, pin the breakpoints, monitor the hit rate, and you'll recover more budget than any other API-level change you can make this quarter. Skip it and you're subsidizing compute the provider is happy to give you back.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.