TechPlained — Tech Insights & Tutorials

The Monday Morning the Bill Arrived

It's 9:47 on a Monday in April 2026. Your Anthropic invoice lands in finance's inbox: $84,312. Last month was $31,900. Your traffic is up maybe 20%. Your CFO forwards it to you with a single line: "Explain." You open the usage dashboard and find the answer in three minutes and four columns -- extended-thinking tokens on Sonnet 4.5, cache writes you thought were cache reads, a cron job that accidentally migrated from Haiku to Opus during a deploy last Tuesday, and 14,000 retries triggered by a rate-limit tier you never upgraded. Each one, individually, looked like noise. Together they tripled the bill.

This is the real shape of LLM cost in 2026. Per-token headline prices are a lie of omission. The bill you actually pay is determined by five orthogonal levers -- input/output ratio, cache hit rate, batch eligibility, reasoning-token overhead, and retry behavior -- and any of them can independently 3-5x your spend. This guide is the map I wish I had the first time I was asked to explain an invoice like that one. Every major provider, every lever, four realistic workload simulations, and the specific pricing gotchas that catch production teams.

The Five Pricing Levers Nobody Advertises

Before the per-token tables, internalize the levers. A provider comparison that skips these is useless.

Lever	What it does to your bill	Typical swing	Who it hurts most
Input/output ratio	Output tokens cost 3-5x input. A 10:1 input:output job and a 1:10 job at the same total tokens differ by 2-3x in price.	2-3x	Code generation, long-form writing, agentic loops
Prompt caching	Cache reads are 10-50% of base input price. Cache writes can be 125% of base. Miss your cache and you pay the write premium every time.	5-10x	RAG with rotating system prompts, per-user personalization
Batch API eligibility	50% off input and output for 24h-tolerant workloads. Zero code change beyond the endpoint.	2x	Teams that ignore async-eligible workloads
Reasoning/thinking tokens	Billed as output but never rendered. 5-20K per call on o3, Opus 4 extended thinking, DeepSeek R1.	3-10x	Anyone who swapped to a reasoning model "to see if it was smarter"
Rate-limit tier	Low tiers = queuing + 429 retries. Retries burn tokens twice. Upgrading the tier often has a $1K-$5K monthly minimum.	1.5-3x	Fast-growing products that outgrow Tier 1-2

A useful mental model: every real LLM invoice is (base_tokens) x (cache_multiplier) x (batch_multiplier) x (reasoning_multiplier) x (retry_multiplier). Multiplicative, not additive. This is why two teams running "the same" workload on the same model can land 4x apart on cost -- and why per-token tables alone never answer the question of what a provider will actually charge you.

Per-Token Pricing: Flagship Models Compared

These are the frontier-class models you would reach for when accuracy and reasoning quality matter most. Prices are per million tokens as of April 2026.

Provider / Model	Input (per 1M)	Output (per 1M)	Context Window	Notes
OpenAI GPT-4.1	$2.00	$8.00	1M	Long-context flagship
OpenAI o3	$10.00	$40.00	200K	Reasoning model; thinking tokens billed as output
OpenAI o4-mini	$1.10	$4.40	200K	Compact reasoning model
Anthropic Claude Opus 4	$15.00	$75.00	200K	Highest capability; extended thinking extra
Anthropic Claude Sonnet 4	$3.00	$15.00	200K	Best quality-to-cost ratio in class
Anthropic Claude Haiku 3.5	$0.80	$4.00	200K	Speed-optimized
Google Gemini 2.5 Pro	$1.25	$10.00	1M	Free tier available (rate-limited)
Google Gemini 2.5 Flash	$0.15	$0.60	1M	Thinking tokens: $0.70/1M output
Mistral Large	$2.00	$6.00	128K	Strong multilingual support
DeepSeek V3	$0.27	$1.10	128K	Cache hits: $0.07/1M input
DeepSeek R1	$0.55	$2.19	128K	Reasoning model; open-weights

Watch out: Reasoning models (o3, o4-mini, DeepSeek R1) generate internal "thinking" tokens that are billed as output tokens but never shown to the user. A single reasoning call can produce 5,000-20,000 thinking tokens before the visible response begins. This makes reasoning models 3-10x more expensive per request than their headline output price suggests.

Inference Hosting Platforms: Open-Source Model Pricing

If you want to run open-weight models (Llama 3.1, Mistral, DeepSeek, Qwen) without managing GPUs, several inference platforms offer per-token pricing that undercuts first-party APIs significantly.

Platform	Model Example	Input (per 1M)	Output (per 1M)	Rate Limits
Together AI	Llama 3.1 405B	$3.50	$3.50	600 RPM default
Together AI	Llama 3.1 70B	$0.88	$0.88	600 RPM default
Fireworks AI	Llama 3.1 405B	$3.00	$3.00	600 RPM, 10M TPM
Fireworks AI	Llama 3.1 70B	$0.90	$0.90	600 RPM, 10M TPM
Groq	Llama 3.1 70B	$0.59	$0.79	30 RPM free, 1000 RPM paid
Groq	Llama 3.3 70B	$0.59	$0.79	Ultra-low latency (LPU)
AWS Bedrock	Claude Sonnet 4	$3.00	$15.00	Provisioned throughput available
GCP Vertex AI	Claude Sonnet 4	$3.00	$15.00	Committed use discounts
AWS Bedrock	Llama 3.1 70B	$0.72	$0.72	On-demand or provisioned

A few things jump out. Open-weight models on third-party platforms cost 60-80% less than frontier proprietary models for comparable quality tiers. Groq's LPU-based inference delivers the lowest latency in the market, but their free-tier rate limits (30 RPM) make them impractical for production without a paid plan. Bedrock and Vertex charge the same per-token rates as the first-party APIs but offer provisioned throughput options that guarantee capacity -- critical for production workloads that can't tolerate 429 errors.

Hidden Costs That Change the Math

Prompt Caching Credits

Both OpenAI and Anthropic offer prompt caching -- if your requests share a common prefix (system prompt, few-shot examples, or large document context), the cached portion is billed at a reduced rate. OpenAI's cached input tokens cost 50% of standard input price. Anthropic charges 90% less for cache reads but adds a 25% surcharge for cache writes (the first request that populates the cache). DeepSeek offers an automatic cache with 90% discount on cache hits.

For a RAG application sending a 4,000-token system prompt on every request, caching can reduce system-prompt costs by 50-90%. But if your prompts vary significantly between requests and cache hit rates stay below 30%, the write surcharge on Anthropic can actually increase your costs.

Batch API Discounts

OpenAI's Batch API offers 50% off input and output tokens for requests that can tolerate 24-hour turnaround. Anthropic's Message Batches API similarly provides 50% discounts. If your workload is offline processing -- document classification, bulk summarization, content moderation at scale -- batch APIs cut costs in half with zero code changes beyond swapping the endpoint.

Fine-Tuning Hosting Costs

Fine-tuning is not just a training cost. OpenAI charges $25/1M training tokens for GPT-4o fine-tuning, but the ongoing hosting cost is the real expense -- fine-tuned GPT-4o inference costs $3.75/1M input and $15.00/1M output, a 50% premium over the base model. Anthropic does not offer public fine-tuning. Google's Gemini fine-tuning is priced per compute-hour during training, with no inference surcharge on tuned models.

Rate Limits and Throughput Tiers

Rate limits are a stealth cost. OpenAI's free tier caps at 500 RPM and 30,000 TPM. Tier 5 (requires $1,000+ cumulative spend) provides 10,000 RPM and 12M TPM. If you need guaranteed throughput above published limits, you're looking at provisioned throughput contracts -- Anthropic's start at 1-month commitments, and AWS Bedrock offers per-model-unit provisioning starting around $50/hour for Claude Sonnet.

Four Workloads Modeled: Real Monthly Costs

Headline per-token rates are meaningless without workload context. Here are four common production patterns with estimated monthly costs across providers.

Workload 1: Customer Support Chatbot

Specs: 50,000 conversations/month, avg 4 turns each, 1,500 input tokens per turn (system prompt + history + retrieval), 400 output tokens per turn. System prompt cacheable.

Provider / Model	Input Cost	Output Cost	Cache Savings	Monthly Total
OpenAI GPT-4.1	$600	$640	-$180 (cache)	$1,060
Anthropic Sonnet 4	$900	$1,200	-$405 (cache)	$1,695
Anthropic Haiku 3.5	$240	$320	-$108 (cache)	$452
Gemini 2.5 Flash	$45	$48	-$20 (cache)	$73
Llama 3.1 70B (Together)	$264	$70	N/A	$334

Gemini 2.5 Flash dominates on cost, but quality matters for customer-facing interactions. Many teams run Haiku 3.5 or GPT-4.1 for quality, with Flash as a fallback for simple queries -- a model-routing strategy that can cut costs by 40-60% without sacrificing quality on hard questions.

Workload 2: Code Generation Pipeline

Specs: 10,000 requests/day, 2,000 input tokens (context + instructions), 1,500 output tokens (generated code). No caching benefit (varied inputs).

Provider / Model	Input Cost	Output Cost	Monthly Total
OpenAI GPT-4.1	$1,200	$3,600	$4,800
Anthropic Sonnet 4	$1,800	$6,750	$8,550
Anthropic Opus 4	$9,000	$33,750	$42,750
Gemini 2.5 Pro	$750	$4,500	$5,250
DeepSeek V3	$162	$495	$657

Output-heavy workloads expose the asymmetry between input and output pricing. DeepSeek V3 is an order of magnitude cheaper, but latency and reliability may be concerns for latency-sensitive pipelines. GPT-4.1 offers the best price-to-capability balance for production code generation among frontier models.

Workload 3: Document Summarization (Batch)

Specs: 100,000 documents/month, avg 8,000 input tokens each, 500 output tokens per summary. Batch API eligible.

Provider / Model	Standard Cost	Batch Cost (50% off)	Monthly Savings
OpenAI GPT-4.1	$2,000	$1,000	$1,000
Anthropic Sonnet 4	$3,150	$1,575	$1,575
Anthropic Haiku 3.5	$840	$420	$420
Gemini 2.5 Flash	$150	N/A	N/A

Batch APIs make a significant difference at scale. Haiku 3.5 in batch mode costs half of what Gemini Flash costs at standard rates. If turnaround time is flexible, always evaluate batch pricing first.

Workload 4: Agentic Workflow (Multi-Step Reasoning)

Specs: 5,000 tasks/month, avg 8 LLM calls per task (tool use, chain-of-thought), 3,000 input tokens and 2,000 output tokens per call. Reasoning model with thinking tokens.

Provider / Model	Visible Token Cost	Thinking Token Cost	Monthly Total
OpenAI o3	$4,400	~$12,800	$17,200
OpenAI o4-mini	$1,760	~$4,400	$6,160
Anthropic Sonnet 4 (ext. thinking)	$2,520	~$6,000	$8,520
DeepSeek R1	$792	~$2,200	$2,992
Gemini 2.5 Flash (thinking)	$120	~$560	$680

Agentic workloads multiply costs because every tool call or chain step is a separate inference call, and reasoning models add hidden thinking-token overhead. Gemini 2.5 Flash with thinking mode is the most cost-effective option for autonomous agent loops, though o3 and Sonnet 4 remain the quality leaders for complex multi-step reasoning tasks.

Cost Optimization Strategies

Prompt Caching

If your system prompt exceeds 1,024 tokens and you're sending more than 100 requests per minute, caching should be your first optimization. Anthropic's cache requires a minimum prefix of 1,024 tokens for Haiku and 2,048 for Sonnet/Opus. OpenAI's automatic caching activates on prompts over 1,024 tokens with no code changes needed. At 50,000 daily requests, caching a 3,000-token system prompt saves $150-$450/month depending on the model.

Batch Processing

Move any workload that can tolerate latency to batch APIs. Document processing, content moderation, data extraction, and evaluation pipelines are natural candidates. The 50% discount is the single biggest cost lever available. Structure your pipeline to queue requests and submit batches on a 1-6 hour cadence for near-real-time results at batch pricing.

Model Routing

Route requests to different models based on complexity. A classifier (or even a regex-based heuristic) can identify simple queries that a cheap model handles well versus complex ones that need a frontier model. A typical routing split is 60% Flash/Haiku, 30% Sonnet/GPT-4.1, and 10% Opus/o3. This alone can reduce costs by 50-70% compared to routing everything through a single frontier model.

Self-Hosting Break-Even Analysis

Self-hosting open-weight models makes economic sense at scale. The break-even point depends on GPU costs and utilization. Running Llama 3.1 70B on a single A100 80GB (roughly $2/hour on cloud GPU providers) supports about 15-30 requests per second at short context. At 1 million requests per month, self-hosting costs approximately $1,440/month versus $880 on Together AI for the same model. Self-hosting wins only when you exceed 2-3 million requests per month, maintain >70% GPU utilization, and have the engineering capacity to operate inference infrastructure. Below that threshold, managed APIs are cheaper when you factor in engineering time.

Failure Modes: How Teams Actually Overspend

Cache-write accounting surprise. Anthropic's cache writes cost 1.25x the base input rate. If your system prompt changes on every request -- even by a timestamp -- you pay the 1.25x premium every time and never hit the 0.1x read rate. Teams that include Date.now() or a request ID in the cached prefix effectively opt into a permanent 25% input-cost surcharge and believe they are saving money.

Retry loops billed twice. OpenAI returns 429 with a Retry-After header when you exceed TPM. Many SDK wrappers retry automatically with exponential backoff. Every retry that eventually succeeds costs double. I have seen a misconfigured batch ingestion job retry 9,400 times in an afternoon against a Tier 2 account -- most of those retries were themselves over-tier and 429'd, so the account burned ~$380 on failed inference. Monitor retries as a first-class cost metric.

Hidden reasoning-token blow-up. Switching from GPT-4.1 to o3 "to see if it helps" looks like a 5x price jump ($2/$8 -> $10/$40). It is actually closer to 15-20x because o3 emits 5-20K thinking tokens per call even for short answers. One team I advised moved their summarization workflow from 4.1 to o3 and saw a 17x bill increase in 24 hours.

Agent loops without a token budget. An autonomous agent that reads 30 tool outputs before deciding will happily burn $3-$5 per task on context alone. Without a hard max_tokens_per_task ceiling, a runaway agent can empty a $10K/month budget in a weekend.

# Instrumenting a real cost ceiling in Python
# Wrap every call so you can kill a task before it drains the budget.
import os, time
from anthropic import Anthropic

client = Anthropic()

INPUT_PRICE = {"claude-sonnet-4-5": 3.0 / 1_000_000}   # $/token
OUTPUT_PRICE = {"claude-sonnet-4-5": 15.0 / 1_000_000}
CACHE_READ = {"claude-sonnet-4-5": 0.30 / 1_000_000}
CACHE_WRITE = {"claude-sonnet-4-5": 3.75 / 1_000_000}

def cost_of(usage, model):
    return (
        usage.input_tokens * INPUT_PRICE[model]
        + usage.output_tokens * OUTPUT_PRICE[model]
        + getattr(usage, "cache_read_input_tokens", 0) * CACHE_READ[model]
        + getattr(usage, "cache_creation_input_tokens", 0) * CACHE_WRITE[model]
    )

def call_with_budget(messages, model, task_budget_usd=0.25, spent_so_far=0.0):
    if spent_so_far >= task_budget_usd:
        raise BudgetExceeded(
            f"Task spent USD {spent_so_far:.2f} of {task_budget_usd:.2f} budget"
        )
    resp = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=messages,
    )
    spend = cost_of(resp.usage, model)
    return resp, spent_so_far + spend

Prompt Caching: The One Optimization That Pays Back in Hours

Of every lever in this guide, prompt caching has the best payback curve. It is a 30-minute code change for a 40-80% cut on input costs in most RAG and chatbot workloads. The trick is structuring your prompt so the cacheable prefix is stable.

// Anthropic caching -- mark the cacheable prefix explicitly.
// Everything before the cache_control breakpoint is cached; everything after is live.
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const SYSTEM_PROMPT = readFileSync("./system_prompt.md", "utf8"); // ~4,500 tokens, stable
const TOOL_DEFS = JSON.stringify(tools);                           // ~2,100 tokens, stable

const resp = await client.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  system: [
    { type: "text", text: SYSTEM_PROMPT, cache_control: { type: "ephemeral" } },
    { type: "text", text: TOOL_DEFS,      cache_control: { type: "ephemeral" } },
  ],
  messages: [
    { role: "user", content: userQuery }, // varies per request -- not cached
  ],
});

// Inspect the usage to confirm you are hitting the cache.
console.log({
  cache_creation: resp.usage.cache_creation_input_tokens, // one-time, 1.25x price
  cache_read:     resp.usage.cache_read_input_tokens,     // 0.1x price -- this is the win
  input:          resp.usage.input_tokens,                 // only the non-cached tail
});

The first request populates the cache at 1.25x. Every request within the 5-minute TTL hits at 0.1x. At 200 requests per minute across a 6,600-token prefix, the savings run $28-$35 per hour versus naive request construction. Over a production month that is $20K-$25K back in your pocket for a single afternoon of work.

Frequently Asked Questions

Which LLM API is cheapest for production use in 2026?

It depends on the workload. For high-volume, quality-flexible tasks, Gemini 2.5 Flash offers the lowest per-token rates among production-grade models at $0.15/$0.60 per million input/output tokens. For tasks requiring frontier reasoning quality, OpenAI GPT-4.1 at $2/$8 offers the best cost-to-capability ratio. DeepSeek V3 is cheapest in absolute terms but has higher latency and availability concerns outside of China-region endpoints.

How do reasoning model costs compare to standard models?

Reasoning models (OpenAI o3/o4-mini, DeepSeek R1, Gemini with thinking) generate internal thinking tokens billed as output. A typical reasoning call produces 5,000-20,000 thinking tokens in addition to the visible response. This makes the effective per-request cost 3-10x higher than standard models. Use reasoning models selectively for tasks that genuinely benefit from multi-step logic -- math, complex analysis, planning -- and route simpler tasks to standard models.

Is prompt caching worth implementing?

Yes, if you meet two conditions: your system prompt or shared context exceeds the minimum cacheable length (1,024-2,048 tokens depending on provider), and you send at least 50-100 requests per minute to maintain cache residency. At high request volumes with a 4,000-token system prompt, caching saves 50-90% on input token costs for the cached portion. For low-volume or highly variable prompts, cache hit rates drop below 20% and the benefit disappears.

Should I use Bedrock or Vertex instead of direct APIs?

Per-token pricing is identical on Bedrock and Vertex compared to direct Anthropic or Google APIs. The value proposition is operational: unified billing through your existing cloud account, VPC endpoints for network isolation, provisioned throughput guarantees, compliance certifications (HIPAA, SOC 2) inherited from the cloud provider, and consolidated IAM. If you're already on AWS or GCP and need enterprise controls, the platform markup is zero on per-token rates -- the cost is purely in the cloud infrastructure (VPC endpoints, NAT gateways) surrounding it.

When does self-hosting LLMs become cheaper than API access?

The break-even point for self-hosting a 70B parameter model is roughly 2-3 million requests per month, assuming you achieve 70%+ GPU utilization and have SRE capacity to maintain the inference stack. Below that volume, managed APIs are almost always cheaper when you account for GPU idle time, engineering overhead, and the operational cost of model updates. Fine-tuned models shift this calculation -- if you need a custom model that no provider hosts, self-hosting is your only option regardless of cost.

How much do rate limits actually cost in practice?

Rate limits impose indirect costs through queuing delays, dropped requests, and over-provisioning. If your application needs 5,000 RPM but your tier caps at 3,500 RPM, you either queue (adding latency), distribute across multiple API keys (adding complexity), or upgrade your tier (often requiring minimum monthly commits of $1,000-$5,000). Some teams run parallel accounts across providers as a rate-limit arbitrage strategy -- routing overflow traffic to a secondary provider during peak hours.

What is the most cost-effective approach for agentic AI workloads?

Model routing combined with caching is the most effective strategy. Use a cheap, fast model (Gemini Flash, Haiku 3.5) for tool-use orchestration and simple decisions within the agent loop, and escalate to a frontier reasoning model (o3, Sonnet 4) only for steps that require complex analysis. Cache the agent's system prompt and tool definitions aggressively. This hybrid approach typically costs 60-75% less than routing all agent steps through a single frontier model, with minimal quality degradation on the orchestration steps.

The Bottom Line on LLM API Costs

Per-token pricing is the starting point, not the answer. Your actual LLM spend is determined by the interaction of model choice, caching behavior, output-to-input ratio, batch eligibility, and rate-limit tier. The providers with the lowest headline rates (DeepSeek, Gemini Flash) are not always cheapest in practice -- latency constraints, reliability requirements, and quality thresholds push many production workloads toward mid-tier pricing.

Start by profiling your workload: measure your input/output token ratio, cache hit potential, batch eligibility percentage, and peak request rate. Then model costs across 3-4 providers using the tables above. The teams that treat LLM cost optimization as an ongoing practice -- not a one-time provider selection -- consistently spend 50-70% less than those that pick a model and never revisit the decision.

LLM API Pricing Compared (2026): OpenAI vs Anthropic vs Google vs Open Source