Self-Hosted LLM Cost: Hardware vs Cloud GPU vs API (2026)
Below 3M tokens/day, the API wins. 3-30M, cloud GPU wins. Above 30M sustained, hardware pays back in 18-24 months. Real 2026 numbers.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Self-Hosted LLM Cost: The Crossover Points That Decide Build vs Rent vs API
Three options compete for any LLM-backed workload: own the hardware, rent cloud GPU, or pay an API per token. The math isn't subtle. Below 3 million tokens per day, the API wins on TCO every time — your hardware sits idle, your cloud-rental dollars buy nothing useful. Between 3M and 30M daily tokens, cloud GPU rental wins because spot pricing is cheap and you don't carry obsolescence risk. Above 30M daily tokens sustained for 12+ months, owned hardware pays back within 18-24 months versus equivalent cloud rental. These crossovers shift slightly with the model size and provider mix, but the structure is robust.
| Daily token volume | Best TCO | Why | 12-month cost (Qwen 32B-class) |
|---|---|---|---|
| Under 1M | API (Claude Haiku, GPT mini, Kimi) | Hardware idle 95%+ of the time | $200-1,200 |
| 1M - 3M | API (DeepSeek, GLM-5.1, Qwen via Together) | Cheap-tier APIs still beat capex | $1,000-4,500 |
| 3M - 10M | Cloud GPU spot (RunPod / Vast) | Spot rentals beat sustained API | $3,500-12,000 |
| 10M - 30M | Cloud GPU reserved (Lambda / AWS p4d) | Predictable workloads = reserved discount | $8,000-30,000 |
| 30M - 100M | Owned hardware (RTX 6000 Ada / dual 4090) | Hardware paid back in 18-24 mo | $8,000 capex + $1,200/yr power |
| 100M+ sustained | Datacenter cluster (8x H100 / H200) | Co-locate, amortize over 3-4 years | $80,000-200,000 capex |
Last updated: April 2026 — verified consumer GPU street prices on NVIDIA channels, RunPod / Vast.ai / Lambda 2026 spot rates, AWS p4d / p5 on-demand, Anthropic / OpenAI / DeepSeek / Zhipu / Together API pricing pages.
The Real Cost of Owning Hardware
The sticker price on a GPU is the smallest part of the TCO. The four real costs:
1. Hardware capex
| Build tier | GPU(s) | Total system cost | Comfortable model |
|---|---|---|---|
| Solo developer | 1x RTX 4090 24 GB | $2,500-3,500 | Qwen 3.5 14B Q5_K_M / 32B Q4_K_M |
| Small team | 1x RTX 5090 32 GB | $3,500-4,500 | Qwen 3.5 32B Q5_K_M (NVFP4) |
| Workstation | 2x RTX 4090 / 1x RTX 6000 Ada | $6,000-9,500 | Qwen 3.5 72B Q4_K_M / 32B FP16 |
| Apple unified | M3 Ultra 128 GB Mac Studio | $5,000-6,500 | Qwen 3.5 72B Q4 / 122B-A10B MoE |
| Datacenter-grade | 4x H100 80GB SXM cluster | $80,000-130,000 | DeepSeek V4, GLM-5.1, frontier models |
2. Power costs
An RTX 4090 runs at 350-450W under sustained inference; an RTX 5090 at 575W; an H100 at 700W per card. At $0.15/kWh (US average) and 50% duty cycle (12 hours/day at full tilt), that's $230-380/year per consumer GPU. Datacenter GPUs are dramatically worse on raw power, but co-location facilities typically include power in the rack rate.
3. Opportunity cost of capital
$8,000 in workstation hardware sitting in a closet is $8,000 not earning the 4-5% you'd get in a savings account. Over 3 years that's $1,000-1,200 of foregone interest. Worth modeling explicitly when you're comparing capex vs operational rental.
4. Obsolescence
This is the silent killer. Consumer GPUs depreciate 30-40% in the first 18 months as new generations ship. RTX 5090 today; RTX 6090 at higher VRAM and new quant formats in 2027; RTX 7090 with whatever architectural shift in 2028. If you're running a single workstation for personal use you can ignore this — the GPU stays useful. For team-scale build-outs, plan for a 3-year hardware refresh cycle.
Watch out: The 24 GB consumer card sweet spot is unstable. NVIDIA has hinted that future generations may push consumer VRAM down (it's a margin lever), and frontier model sizes keep growing. The hardware that runs Qwen 3.5 32B today may not run Qwen 4.5's equivalent in 2027. Capex math should assume 18-24 month software obsolescence, not 5-year hardware lifespan.
Cloud GPU Rental: The Middle-Volume Sweet Spot
For 3-30M tokens/day, neither owned hardware nor commodity APIs are the right answer. RunPod vs Vast.ai comparison covers the spot-marketplace landscape; here are the rates that drive the math:
| Provider | GPU | Spot $/hr | On-demand $/hr | Reserved (yearly) $/hr |
|---|---|---|---|---|
| RunPod | RTX 4090 24 GB | $0.34 | $0.69 | $0.45 |
| RunPod | A100 80 GB | $1.19 | $1.89 | $1.25 |
| RunPod | H100 80 GB | $1.99 | $2.99 | $2.20 |
| Vast.ai | RTX 4090 24 GB | $0.22-0.40 | — | — |
| Vast.ai | H100 80 GB | $1.50-2.30 | — | — |
| Lambda Labs | H100 80 GB | — | $2.49 | $1.99 (reserved) |
| AWS p4d.24xlarge (8x A100) | per instance | $8.20 (spot) | $32.77 | $19.66 (1yr Savings Plan) |
| AWS p5.48xlarge (8x H100) | per instance | — | $98.32 | $58.99 (1yr Savings Plan) |
The "always-on" trap
The cheapest way to lose money on cloud GPU is to rent on-demand and forget to turn it off. A single RTX 4090 on RunPod on-demand running 24/7 is $497/month — well over $5,900/year. The same workload on Vast.ai spot at $0.30/hr with 50% utilization is $1,314/year. Always rent spot for non-prod, always set auto-stop policies for prod, always use reserved tiers if your workload is genuinely 24/7.
API Pricing: Where the Per-Token Math Pays Off
The 2026 API landscape is dramatically cheaper than 2024. The Qwen 3.5 32B equivalent through Together AI costs $0.30 input / $0.90 output per 1M tokens — pricing that was unthinkable two years ago. Full LLM API pricing comparison covers the broader provider matrix; here are the cost-effective tier picks for self-hosted-equivalent quality:
| Model | Provider | Input $/1M | Output $/1M | Cache discount |
|---|---|---|---|---|
| Qwen 3.5 32B Instruct | Together AI | $0.30 | $0.90 | — |
| Qwen 3.5 72B Instruct | Together AI | $0.90 | $2.70 | — |
| DeepSeek V4 | DeepSeek direct | $0.27 | $1.10 | 75% |
| GLM-5.1 | Zhipu AI | $0.50 | $1.50 | 50% |
| Kimi K2.6 | Moonshot | $0.30 | $1.00 | — |
| Claude Haiku 4.5 (ref) | Anthropic | $1.00 | $5.00 | 90% |
| Claude Sonnet 4.6 (ref) | Anthropic | $3.00 | $15.00 | 90% |
| GPT-5.4 (ref) | OpenAI | $2.50 | $10.00 | 50% |
The directional message: open-model API pricing has converged on roughly $0.30-1.00 per 1M output tokens for Qwen-32B-class quality. That's 5-15x cheaper than Anthropic/OpenAI for similar capability on most tasks. For self-hosting to pay back, you need either substantially higher volume than the API breakeven point, or a hard requirement (data sovereignty, sub-10ms latency, custom fine-tunes) that APIs can't satisfy.
Real Workload Math: Three Scenarios
Calculations below assume 1 input token costs same as 1 output token for simplicity (real ratios vary 30-70% input depending on task). Numbers are 12-month TCO.
Scenario A: Solo developer, 500K tokens/day
- API (DeepSeek V4): 500K × 365 = 182M tokens/year × $0.69/1M average = $126/year
- Cloud GPU: 1x RTX 4090 spot at $0.30/hr × 4 hrs/day actual use × 365 = $438/year
- Owned hardware: $3,000 RTX 4090 build + $230 power = $3,230 year 1
- Verdict: API wins by 25x. Don't even consider hardware at this volume.
Scenario B: Small team, 8M tokens/day
- API (DeepSeek V4): 8M × 365 = 2.92B tokens/year × $0.69/1M = $2,015/year
- Cloud GPU spot: 1x RTX 4090 at $0.30/hr × 12 hrs/day × 365 = $1,314/year
- Owned hardware: $3,000 RTX 4090 build + $230 power = $3,230 year 1, $230 year 2
- Verdict: Cloud GPU spot wins year 1 by ~$700. Owned hardware pays back in year 2 if volume holds.
Scenario C: Growth-stage startup, 50M tokens/day
- API (DeepSeek V4): 50M × 365 = 18.25B tokens/year × $0.69/1M = $12,593/year
- Cloud GPU reserved: 1x A100 at $1.25/hr × 24 hrs × 365 = $10,950/year
- Owned hardware: $9,500 RTX 6000 Ada workstation + $400 power = $9,900 year 1, $400 year 2-3
- Verdict: Owned hardware wins by year 2 cumulative; cloud reserved is fine year 1; API is fine until you scale to higher concurrency.
Pro tip: The TCO math gets dramatically better when you can use a smaller model for 80% of work. A Qwen 3.5 9B on a $1,500 RTX 4060 Ti 16GB build handles routine completions, and you only invoke the API for the hard 20%. This "tiered" architecture often beats both pure-API and pure-hardware on cost — covered in the best GPU for LLMs analysis.
Hidden Costs Nobody Mentions
API hidden costs
- Rate limits: Hitting the per-minute or per-day cap forces you to pay for higher tier OR queue your work. Anthropic and OpenAI tiers escalate aggressively with usage.
- Cold-start latency: Some providers have meaningful TTFT variance (200ms-2s). For interactive UX this requires caching or fallback paths you have to build and maintain.
- Vendor risk: APIs can be deprecated, repriced, or rate-limited. Anthropic deprecated Claude 3 Sonnet in 2025; OpenAI repeatedly changes per-minute caps. Migration costs aren't zero.
Cloud GPU hidden costs
- Spot interruption: Spot instances can disappear with 30-second notice. For batch work this is fine; for interactive, you need fallback infrastructure.
- Data egress: AWS charges $0.09/GB outbound; pulling a 100GB model fine-tune dataset out of S3 costs $9 every time. Co-locate compute and storage in the same region.
- Setup time: Each rental session needs container pull, model load, warm-up. For sub-1-hour sessions this is meaningful overhead.
Owned hardware hidden costs
- Operational overhead: Drivers, llama.cpp updates, CUDA versions, kernel updates that break your inference setup. Realistically 2-4 hours/month of engineer time just to keep the rig healthy.
- Cooling and physical space: An RTX 4090 puts out 1500 BTU/hr under load. In a small office without dedicated cooling, that's a problem.
- Single-point-of-failure risk: Workstation goes down, your inference pipeline goes down. Need a fallback API key for outages.
The Build-or-Buy Decision Framework
Run through these questions in order. The answer to the first one that triggers is your call.
- Is your daily token volume under 3M and likely to stay there for 12+ months? → API. Don't overthink it.
- Do you have a hard data-sovereignty requirement (regulated industry, EU GDPR strict, India DPDP Act, China)? → Owned hardware or self-hosted on rented hardware in the right jurisdiction. Use the India self-hosting guide for India-specific options.
- Is your workload bursty (spike to high volume, then quiet)? → Cloud GPU spot. Don't pay for capacity you don't use.
- Is your workload steady high volume (sustained 24/7)? → Cloud GPU reserved (year 1) or owned hardware (year 2+). Crossover is roughly 20-30M tokens/day.
- Are you doing custom fine-tuning at scale? → Owned hardware or rented GPU clusters. Fine-tuning needs full-VRAM access that APIs don't provide.
- Default for everything else: API. The math has shifted heavily toward APIs in 2026; the bar for self-hosting is higher than it was in 2024.
Pro tip: If you're on the fence, run a 30-day API pilot with full token logging before committing to hardware. Real production volume is almost always lower than estimates, and the pilot data tells you exactly which scenario you're in. The advanced cost-modeling spreadsheets I use for client engagements I send to the newsletter.
Frequently Asked Questions
When does self-hosting LLMs pay off?
Self-hosting on owned hardware pays back versus cloud rental at roughly 30M tokens/day sustained for 18-24 months. Below that volume, cloud GPU rental or API consumption is cheaper. Self-hosting also wins when you have hard data-sovereignty requirements, sub-10ms latency needs, or custom fine-tunes — regardless of volume.
How much does it cost to host a private LLM?
Solo developer hardware: $2,500-4,500 capex (RTX 4090 / 5090 build) plus $230-380/year power. Workstation tier: $6,000-9,500 capex. Datacenter cluster: $80,000+. Cloud rental for equivalent capacity: $1,300-12,000/year depending on duty cycle and reserved-vs-spot mix. API equivalent: $200-12,000/year depending on volume.
Is RunPod cheaper than AWS for AI?
Yes, dramatically. RunPod spot pricing is roughly 60-75% cheaper than AWS on-demand for equivalent GPU. RTX 4090 on RunPod is $0.34/hr spot vs no comparable AWS consumer-tier instance; H100 on RunPod is $1.99/hr spot vs AWS p5.48xlarge at $98.32/hr on-demand (~$12.30/hr per H100). AWS wins on enterprise integration; RunPod and Vast.ai win on raw cost.
What's the cheapest way to run a 32B LLM?
For sporadic use: API access (Together AI Qwen 3.5 32B at $0.30/$0.90 per 1M tokens). For sustained moderate use: 1x RTX 4090 spot rental at $0.30-0.50/hr. For sustained heavy use: owned RTX 4090 or RTX 5090 build at $2,500-4,500 capex amortized. The crossover from API to owned hardware is roughly 30M tokens/day sustained.
How much VRAM do I need to self-host LLMs?
Practical minimums by model size at Q4_K_M quantization with 8K context: 7B needs 6 GB, 9B needs 8 GB, 14B needs 12 GB, 32B needs 24 GB, 72B needs 48 GB. The Qwen 3.5 VRAM matrix has the full breakdown including KV cache scaling. For frontier-class quality on a single consumer GPU, plan for 24 GB (RTX 4090 or 3090).
Should I buy an RTX 4090 or RTX 5090 for LLMs?
RTX 5090 (32 GB, NVFP4 native) is a meaningful upgrade if you can find one at MSRP — it runs Qwen 3.5 32B at Q5_K_M comfortably where 4090 needs Q4_K_M. RTX 4090 (24 GB) is still the cost-per-VRAM sweet spot in mid-2026 at second-hand prices. Buy 5090 if money is no object; buy used 4090 if budget matters.
What's the API tipping point for switching to self-hosted?
For Qwen 3.5 32B-class quality, the API-to-cloud-rental crossover is roughly 3M tokens/day sustained. The cloud-rental-to-owned-hardware crossover is roughly 30M tokens/day for 18+ months. Below 3M tokens/day, never self-host. Above 100M tokens/day sustained, the math heavily favors owned hardware or reserved cloud GPU.
Run an API Pilot, Then Decide
The single biggest mistake I see with self-hosted LLM cost decisions is running the math on hypothetical volume rather than measured volume. Build with the API for 30 days, log every token, measure what you actually consume. Then check the volume against the crossover points in this guide. Most teams discover their real volume sits 30-60% below their forecast — meaning the API is the right answer where they assumed hardware was. The teams that genuinely need self-hosted infrastructure usually find out by hitting API rate limits, not by spreadsheet projection.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Multi-Cluster Kubernetes: Argo CD ApplicationSet Patterns
When 10+ clusters or 50+ services break hand-written GitOps. ApplicationSet's four generators (cluster list, Git directory, PR, cluster decision), real production patterns (env promotion, per-tenant, multi-region failover, preview envs), and the sharp edges (template debugging, cascading mistakes, RBAC).
11 min read
AI/ML EngineeringLLM Latency: TTFT, ITL, and Why End-User Latency Isn't What You Think
LLM latency decomposes into TTFT (time to first token, 300-1500ms), ITL (inter-token, 10-30ms), and total time. Each has different causes and fixes. Why streaming dominates UX, when Cerebras/Groq beat Claude on speed, and the optimization playbook.
11 min read
DevOpsPython uv vs pip vs Poetry vs PDM: Speed Benchmarks 2026
Real benchmarks: uv installs Django + ML stack in 8s vs pip's 90s, Poetry's 50s, PDM's 38s. Why uv is fast (Rust + parallelism + PubGrub), what pip still does that uv doesn't, migration paths, and where Poetry's ergonomics still win.
12 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.