Self-Hosted LLM Cost: Hardware vs Cloud vs API (2026)

Self-Hosted LLM Cost: The Crossover Points That Decide Build vs Rent vs API

Three options compete for any LLM-backed workload: own the hardware, rent cloud GPU, or pay an API per token. The math isn't subtle. Below 3 million tokens per day, the API wins on TCO every time — your hardware sits idle, your cloud-rental dollars buy nothing useful. Between 3M and 30M daily tokens, cloud GPU rental wins because spot pricing is cheap and you don't carry obsolescence risk. Above 30M daily tokens sustained for 12+ months, owned hardware pays back within 18-24 months versus equivalent cloud rental. These crossovers shift slightly with the model size and provider mix, but the structure is robust.

Daily token volume	Best TCO	Why	12-month cost (Qwen 32B-class)
Under 1M	API (Claude Haiku, GPT mini, Kimi)	Hardware idle 95%+ of the time	$200-1,200
1M - 3M	API (DeepSeek, GLM-5.1, Qwen via Together)	Cheap-tier APIs still beat capex	$1,000-4,500
3M - 10M	Cloud GPU spot (RunPod / Vast)	Spot rentals beat sustained API	$3,500-12,000
10M - 30M	Cloud GPU reserved (Lambda / AWS p4d)	Predictable workloads = reserved discount	$8,000-30,000
30M - 100M	Owned hardware (RTX 6000 Ada / dual 4090)	Hardware paid back in 18-24 mo	$8,000 capex + $1,200/yr power
100M+ sustained	Datacenter cluster (8x H100 / H200)	Co-locate, amortize over 3-4 years	$80,000-200,000 capex

Last updated: April 2026 — verified consumer GPU street prices on NVIDIA channels, RunPod / Vast.ai / Lambda 2026 spot rates, AWS p4d / p5 on-demand, Anthropic / OpenAI / DeepSeek / Zhipu / Together API pricing pages.

The Real Cost of Owning Hardware

The sticker price on a GPU is the smallest part of the TCO. The four real costs:

1. Hardware capex

Build tier	GPU(s)	Total system cost	Comfortable model
Solo developer	1x RTX 4090 24 GB	$2,500-3,500	Qwen 3.5 14B Q5_K_M / 32B Q4_K_M
Small team	1x RTX 5090 32 GB	$3,500-4,500	Qwen 3.5 32B Q5_K_M (NVFP4)
Workstation	2x RTX 4090 / 1x RTX 6000 Ada	$6,000-9,500	Qwen 3.5 72B Q4_K_M / 32B FP16
Apple unified	M3 Ultra 128 GB Mac Studio	$5,000-6,500	Qwen 3.5 72B Q4 / 122B-A10B MoE
Datacenter-grade	4x H100 80GB SXM cluster	$80,000-130,000	DeepSeek V4, GLM-5.1, frontier models

2. Power costs

An RTX 4090 runs at 350-450W under sustained inference; an RTX 5090 at 575W; an H100 at 700W per card. At $0.15/kWh (US average) and 50% duty cycle (12 hours/day at full tilt), that's $230-380/year per consumer GPU. Datacenter GPUs are dramatically worse on raw power, but co-location facilities typically include power in the rack rate.

3. Opportunity cost of capital

$8,000 in workstation hardware sitting in a closet is $8,000 not earning the 4-5% you'd get in a savings account. Over 3 years that's $1,000-1,200 of foregone interest. Worth modeling explicitly when you're comparing capex vs operational rental.

4. Obsolescence

This is the silent killer. Consumer GPUs depreciate 30-40% in the first 18 months as new generations ship. RTX 5090 today; RTX 6090 at higher VRAM and new quant formats in 2027; RTX 7090 with whatever architectural shift in 2028. If you're running a single workstation for personal use you can ignore this — the GPU stays useful. For team-scale build-outs, plan for a 3-year hardware refresh cycle.

Watch out: The 24 GB consumer card sweet spot is unstable. NVIDIA has hinted that future generations may push consumer VRAM down (it's a margin lever), and frontier model sizes keep growing. The hardware that runs Qwen 3.5 32B today may not run Qwen 4.5's equivalent in 2027. Capex math should assume 18-24 month software obsolescence, not 5-year hardware lifespan.

Cloud GPU Rental: The Middle-Volume Sweet Spot

For 3-30M tokens/day, neither owned hardware nor commodity APIs are the right answer. RunPod vs Vast.ai comparison covers the spot-marketplace landscape; here are the rates that drive the math:

Provider	GPU	Spot $/hr	On-demand $/hr	Reserved (yearly) $/hr
RunPod	RTX 4090 24 GB	$0.34	$0.69	$0.45
RunPod	A100 80 GB	$1.19	$1.89	$1.25
RunPod	H100 80 GB	$1.99	$2.99	$2.20
Vast.ai	RTX 4090 24 GB	$0.22-0.40	—	—
Vast.ai	H100 80 GB	$1.50-2.30	—	—
Lambda Labs	H100 80 GB	—	$2.49	$1.99 (reserved)
AWS p4d.24xlarge (8x A100)	per instance	$8.20 (spot)	$32.77	$19.66 (1yr Savings Plan)
AWS p5.48xlarge (8x H100)	per instance	—	$98.32	$58.99 (1yr Savings Plan)

The "always-on" trap

The cheapest way to lose money on cloud GPU is to rent on-demand and forget to turn it off. A single RTX 4090 on RunPod on-demand running 24/7 is $497/month — well over $5,900/year. The same workload on Vast.ai spot at $0.30/hr with 50% utilization is $1,314/year. Always rent spot for non-prod, always set auto-stop policies for prod, always use reserved tiers if your workload is genuinely 24/7.

API Pricing: Where the Per-Token Math Pays Off

The 2026 API landscape is dramatically cheaper than 2024. The Qwen 3.5 32B equivalent through Together AI costs $0.30 input / $0.90 output per 1M tokens — pricing that was unthinkable two years ago. Full LLM API pricing comparison covers the broader provider matrix; here are the cost-effective tier picks for self-hosted-equivalent quality:

Model	Provider	Input $/1M	Output $/1M	Cache discount
Qwen 3.5 32B Instruct	Together AI	$0.30	$0.90	—
Qwen 3.5 72B Instruct	Together AI	$0.90	$2.70	—
DeepSeek V4	DeepSeek direct	$0.27	$1.10	75%
GLM-5.1	Zhipu AI	$0.50	$1.50	50%
Kimi K2.6	Moonshot	$0.30	$1.00	—
Claude Haiku 4.5 (ref)	Anthropic	$1.00	$5.00	90%
Claude Sonnet 4.6 (ref)	Anthropic	$3.00	$15.00	90%
GPT-5.4 (ref)	OpenAI	$2.50	$10.00	50%

The directional message: open-model API pricing has converged on roughly $0.30-1.00 per 1M output tokens for Qwen-32B-class quality. That's 5-15x cheaper than Anthropic/OpenAI for similar capability on most tasks. For self-hosting to pay back, you need either substantially higher volume than the API breakeven point, or a hard requirement (data sovereignty, sub-10ms latency, custom fine-tunes) that APIs can't satisfy.

Real Workload Math: Three Scenarios

Calculations below assume 1 input token costs same as 1 output token for simplicity (real ratios vary 30-70% input depending on task). Numbers are 12-month TCO.

Scenario A: Solo developer, 500K tokens/day

API (DeepSeek V4): 500K × 365 = 182M tokens/year × $0.69/1M average = $126/year
Cloud GPU: 1x RTX 4090 spot at $0.30/hr × 4 hrs/day actual use × 365 = $438/year
Owned hardware: $3,000 RTX 4090 build + $230 power = $3,230 year 1
Verdict: API wins by 25x. Don't even consider hardware at this volume.

Scenario B: Small team, 8M tokens/day

API (DeepSeek V4): 8M × 365 = 2.92B tokens/year × $0.69/1M = $2,015/year
Cloud GPU spot: 1x RTX 4090 at $0.30/hr × 12 hrs/day × 365 = $1,314/year
Owned hardware: $3,000 RTX 4090 build + $230 power = $3,230 year 1, $230 year 2
Verdict: Cloud GPU spot wins year 1 by ~$700. Owned hardware pays back in year 2 if volume holds.

Scenario C: Growth-stage startup, 50M tokens/day

API (DeepSeek V4): 50M × 365 = 18.25B tokens/year × $0.69/1M = $12,593/year
Cloud GPU reserved: 1x A100 at $1.25/hr × 24 hrs × 365 = $10,950/year
Owned hardware: $9,500 RTX 6000 Ada workstation + $400 power = $9,900 year 1, $400 year 2-3
Verdict: Owned hardware wins by year 2 cumulative; cloud reserved is fine year 1; API is fine until you scale to higher concurrency.

Pro tip: The TCO math gets dramatically better when you can use a smaller model for 80% of work. A Qwen 3.5 9B on a $1,500 RTX 4060 Ti 16GB build handles routine completions, and you only invoke the API for the hard 20%. This "tiered" architecture often beats both pure-API and pure-hardware on cost — covered in the best GPU for LLMs analysis.

Hidden Costs Nobody Mentions

API hidden costs

Rate limits: Hitting the per-minute or per-day cap forces you to pay for higher tier OR queue your work. Anthropic and OpenAI tiers escalate aggressively with usage.
Cold-start latency: Some providers have meaningful TTFT variance (200ms-2s). For interactive UX this requires caching or fallback paths you have to build and maintain.
Vendor risk: APIs can be deprecated, repriced, or rate-limited. Anthropic deprecated Claude 3 Sonnet in 2025; OpenAI repeatedly changes per-minute caps. Migration costs aren't zero.

Cloud GPU hidden costs

Spot interruption: Spot instances can disappear with 30-second notice. For batch work this is fine; for interactive, you need fallback infrastructure.
Data egress: AWS charges $0.09/GB outbound; pulling a 100GB model fine-tune dataset out of S3 costs $9 every time. Co-locate compute and storage in the same region.
Setup time: Each rental session needs container pull, model load, warm-up. For sub-1-hour sessions this is meaningful overhead.

Owned hardware hidden costs

Operational overhead: Drivers, llama.cpp updates, CUDA versions, kernel updates that break your inference setup. Realistically 2-4 hours/month of engineer time just to keep the rig healthy.
Cooling and physical space: An RTX 4090 puts out 1500 BTU/hr under load. In a small office without dedicated cooling, that's a problem.
Single-point-of-failure risk: Workstation goes down, your inference pipeline goes down. Need a fallback API key for outages.

The Build-or-Buy Decision Framework

Run through these questions in order. The answer to the first one that triggers is your call.

Is your daily token volume under 3M and likely to stay there for 12+ months? → API. Don't overthink it.
Do you have a hard data-sovereignty requirement (regulated industry, EU GDPR strict, India DPDP Act, China)? → Owned hardware or self-hosted on rented hardware in the right jurisdiction. Use the India self-hosting guide for India-specific options.
Is your workload bursty (spike to high volume, then quiet)? → Cloud GPU spot. Don't pay for capacity you don't use.
Is your workload steady high volume (sustained 24/7)? → Cloud GPU reserved (year 1) or owned hardware (year 2+). Crossover is roughly 20-30M tokens/day.
Are you doing custom fine-tuning at scale? → Owned hardware or rented GPU clusters. Fine-tuning needs full-VRAM access that APIs don't provide.
Default for everything else: API. The math has shifted heavily toward APIs in 2026; the bar for self-hosting is higher than it was in 2024.

Pro tip: If you're on the fence, run a 30-day API pilot with full token logging before committing to hardware. Real production volume is almost always lower than estimates, and the pilot data tells you exactly which scenario you're in. The advanced cost-modeling spreadsheets I use for client engagements I send to the newsletter.

Frequently Asked Questions

When does self-hosting LLMs pay off?

Self-hosting on owned hardware pays back versus cloud rental at roughly 30M tokens/day sustained for 18-24 months. Below that volume, cloud GPU rental or API consumption is cheaper. Self-hosting also wins when you have hard data-sovereignty requirements, sub-10ms latency needs, or custom fine-tunes — regardless of volume.

How much does it cost to host a private LLM?

Solo developer hardware: $2,500-4,500 capex (RTX 4090 / 5090 build) plus $230-380/year power. Workstation tier: $6,000-9,500 capex. Datacenter cluster: $80,000+. Cloud rental for equivalent capacity: $1,300-12,000/year depending on duty cycle and reserved-vs-spot mix. API equivalent: $200-12,000/year depending on volume.

Is RunPod cheaper than AWS for AI?

Yes, dramatically. RunPod spot pricing is roughly 60-75% cheaper than AWS on-demand for equivalent GPU. RTX 4090 on RunPod is $0.34/hr spot vs no comparable AWS consumer-tier instance; H100 on RunPod is $1.99/hr spot vs AWS p5.48xlarge at $98.32/hr on-demand (~$12.30/hr per H100). AWS wins on enterprise integration; RunPod and Vast.ai win on raw cost.

What's the cheapest way to run a 32B LLM?

For sporadic use: API access (Together AI Qwen 3.5 32B at $0.30/$0.90 per 1M tokens). For sustained moderate use: 1x RTX 4090 spot rental at $0.30-0.50/hr. For sustained heavy use: owned RTX 4090 or RTX 5090 build at $2,500-4,500 capex amortized. The crossover from API to owned hardware is roughly 30M tokens/day sustained.

How much VRAM do I need to self-host LLMs?

Practical minimums by model size at Q4_K_M quantization with 8K context: 7B needs 6 GB, 9B needs 8 GB, 14B needs 12 GB, 32B needs 24 GB, 72B needs 48 GB. The Qwen 3.5 VRAM matrix has the full breakdown including KV cache scaling. For frontier-class quality on a single consumer GPU, plan for 24 GB (RTX 4090 or 3090).

Should I buy an RTX 4090 or RTX 5090 for LLMs?

RTX 5090 (32 GB, NVFP4 native) is a meaningful upgrade if you can find one at MSRP — it runs Qwen 3.5 32B at Q5_K_M comfortably where 4090 needs Q4_K_M. RTX 4090 (24 GB) is still the cost-per-VRAM sweet spot in mid-2026 at second-hand prices. Buy 5090 if money is no object; buy used 4090 if budget matters.

What's the API tipping point for switching to self-hosted?

For Qwen 3.5 32B-class quality, the API-to-cloud-rental crossover is roughly 3M tokens/day sustained. The cloud-rental-to-owned-hardware crossover is roughly 30M tokens/day for 18+ months. Below 3M tokens/day, never self-host. Above 100M tokens/day sustained, the math heavily favors owned hardware or reserved cloud GPU.

Run an API Pilot, Then Decide

The single biggest mistake I see with self-hosted LLM cost decisions is running the math on hypothetical volume rather than measured volume. Build with the API for 30 days, log every token, measure what you actually consume. Then check the volume against the crossover points in this guide. Most teams discover their real volume sits 30-60% below their forecast — meaning the API is the right answer where they assumed hardware was. The teams that genuinely need self-hosted infrastructure usually find out by hitting API rate limits, not by spreadsheet projection.

Self-Hosted LLM Cost: Hardware vs Cloud GPU vs API (2026)