LLM Latency: TTFT, ITL, and Why End-User Latency Isn't What You Think
LLM latency decomposes into TTFT (time to first token, 300-1500ms), ITL (inter-token, 10-30ms), and total time. Each has different causes and fixes. Why streaming dominates UX, when Cerebras/Groq beat Claude on speed, and the optimization playbook.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

What "Latency" Actually Decomposes Into
For LLM-backed applications, "the API is slow" is not a useful metric. The latency a user sees decomposes into three distinct components, each with different causes and different fixes:
- TTFT (Time To First Token): how long from "user clicks send" to "first character appears." Typical: 300-1500ms. Dominated by prompt-processing time, model load, and queue depth.
- ITL (Inter-Token Latency): how long between successive tokens. Typical: 10-30ms per token. Determined by model size, decoding speed, and KV cache locality.
- Total time: TTFT + (ITL × output_length). For a 500-token response, total is roughly TTFT + 5-15 seconds.
Streaming changes the perceived latency dramatically even when total time is identical. Humans react to first-token; once tokens start arriving, the brain interprets it as "the AI is responding" rather than "the AI is still thinking." Two responses with identical 12-second total time feel completely different if one shows the first token at 300ms and the other shows nothing for 12 seconds. Streaming is the single biggest perceived-latency optimization for LLM apps, and most apps that don't stream are leaving real UX value on the table.
For deeper LLM inference mechanics see LLM inference: tokens, context, and KV cache. This article is the practical playbook for measuring and optimizing TTFT and ITL specifically.
What Drives TTFT
Prompt-Processing Time (the dominant factor)
Before the model can produce the first output token, it processes the entire input prompt — building the KV cache that subsequent decoding will use. Time scales with: prompt length × model size. A 100K-token prompt processed by Opus 4.7 takes roughly 600-900ms before first token; the same prompt on Haiku 4.5 takes 200-300ms.
This is why LLM prompt caching is so impactful for TTFT — a cached prefix skips most of the prompt-processing time. Anthropic's 90% cache discount isn't just a cost saving; it's also a TTFT optimization that often reduces TTFT by 60-80% on cached requests.
Model Load and Cold Start
For self-hosted inference, the first request to a freshly-loaded model pays a cold-start penalty (200-2000ms) for KV cache initialization, weight paging, GPU memory layout. Hosted APIs (Anthropic, OpenAI, Gemini) hide this — they keep models warm — but for self-hosted vLLM / TGI / Triton deployments, cold start is a real concern. Either keep models always-warm or accept the latency on the first request after idle.
Queue Depth (the variable factor)
Hosted APIs have queues. If the provider's serving infrastructure is busy, your request waits. This is the source of high TTFT variance — p50 might be 400ms but p99 reaches 3000ms during peak hours. The provider's batch-decoding scheduler interleaves requests; if they're all decoding long outputs, your request waits longer for a slot.
For latency-sensitive UX, route to multiple providers and pick the fastest live response. For cost-sensitive workloads, accept the variance.
What Drives ITL
Model Size
Bigger models decode slower. Each token requires a forward pass through the entire model. Approximate ITL by model size on H100-class hardware:
- Haiku 4.5 / Sonnet 4.6: 8-12ms ITL
- Opus 4.7: 12-15ms ITL
- GPT-5 / GPT-5.4: 10-15ms ITL
- Gemini 3.1 Pro: 12-18ms ITL (varies more by region)
- DeepSeek V4 (1T MoE, 32B active): 20-30ms ITL — the routing overhead between experts adds compute
- Qwen 3.5 32B (self-hosted on H100): 8-12ms ITL
KV Cache Locality
For long sequences, KV cache fits in GPU memory but accesses are bandwidth-limited. As context grows past ~32K tokens, memory bandwidth becomes the bottleneck (not compute), and ITL grows. A 1M-token context decodes meaningfully slower than a 1K-token context — typically 2-4x slower per token.
For Gemini 3.1 Pro at 1.5M context, ITL stretches from ~12ms (short context) to ~40ms (max context). For interactive use, keep context under 200-400K tokens to maintain reasonable ITL.
Decoding Strategy
- Greedy / temp 0: fastest decoding (single token per pass)
- Beam search: 2-4x slower (multiple parallel candidates) — rarely used in 2026
- Speculative decoding: a small "draft" model proposes tokens, the big model verifies. 2-3x ITL improvement on average. Many providers use it transparently. See "speculative decoding" below.
Provider Differences That Matter
| Provider | Typical TTFT | Typical ITL | Notes |
|---|---|---|---|
| Anthropic Sonnet 4.6 | ~280ms | ~10ms | Lowest TTFT for typical Sonnet workload in 2026 |
| Anthropic Opus 4.7 | ~450ms | ~14ms | Bigger model, slightly higher TTFT and ITL |
| OpenAI GPT-5.4 | ~380ms | ~12ms | Comparable to Sonnet |
| Gemini 3.1 Pro (Vertex AI) | ~520ms | ~14ms | Slightly higher TTFT due to Google's auth overhead |
| Cerebras (Llama 4 405B) | ~150ms | ~2-3ms | Wafer-scale chip, dramatically lower ITL |
| Groq (Llama 4) | ~140ms | ~3-4ms | Custom LPU hardware, very low ITL |
| Self-hosted vLLM Qwen 3.5 32B (1× H100) | ~250ms | ~10ms | Comparable to hosted, depends on tuning |
Cerebras and Groq stand out because they're purpose-built inference hardware (CS-2 wafer-scale, LPU respectively) optimized specifically for low ITL. The trade-off: model selection is limited (mostly Llama family open-weights), and quality-per-token is below frontier-tier Claude / GPT / Gemini. For real-time UX where ITL is the bottleneck and "good enough" beats "best," they're the right pick. For frontier-quality work, Anthropic / OpenAI / Gemini.
How to Measure TTFT and ITL
vLLM /metrics Endpoint
import requests
metrics = requests.get('http://localhost:8000/metrics').text
# Look for these Prometheus metrics:
# vllm:time_to_first_token_seconds (histogram)
# vllm:time_per_output_token_seconds (histogram)
# vllm:request_prompt_tokens (histogram)
# vllm:e2e_request_latency_seconds (histogram)
OpenAI / Anthropic SDK Timing
import time
from anthropic import Anthropic
client = Anthropic()
start = time.time()
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=500,
messages=[{"role": "user", "content": "Hello"}]
) as stream:
first_token_time = None
token_times = []
for text in stream.text_stream:
now = time.time()
if first_token_time is None:
first_token_time = now
ttft = first_token_time - start
else:
token_times.append(now - last_time)
last_time = now
itl_p50 = sorted(token_times)[len(token_times) // 2]
print(f"TTFT: {ttft * 1000:.0f}ms, ITL p50: {itl_p50 * 1000:.0f}ms")
Custom Logger Middleware (for OpenAI-compatible APIs)
Wrap the OpenAI SDK with a streaming-aware middleware that records first-token-arrival and subsequent inter-token deltas. Many production stacks use this with Datadog or Honeycomb tracing. For broader observability patterns see AI observability.
How to Optimize TTFT
- Aggressive prompt caching: 60-80% TTFT improvement on cached requests. The single highest-leverage optimization. See Anthropic vs OpenAI prompt caching for provider details.
- Smaller model for TTFT-bound calls: if your workload is "many short responses where TTFT matters more than depth," route to Sonnet / Haiku instead of Opus. The TTFT difference is often 100-200ms.
- Geographic colocation: API calls from the same region as the provider have ~40-80ms lower TTFT than cross-region. For India see India cloud latency.
- Connection reuse and HTTP/2: avoid TLS handshake on every request. Use HTTP/2 or HTTP/3 for multiplexing. The SDK handles this; verify if you've written your own client.
- Pre-warmed sessions: for self-hosted vLLM, keep the GPU warm with periodic dummy requests during off-hours. Cold-start penalty is real.
How to Optimize ITL
- Smaller model for ITL-bound use cases: a Sonnet 4.6 response at 10ms/token completes a 500-token output in 5 seconds. An Opus 4.7 response at 14ms/token takes 7 seconds. For real-time UX where ITL matters more than depth, prefer smaller.
- Speculative decoding: a small draft model proposes 4-8 tokens, the big model verifies. If the verification confirms, you get 4-8 tokens for the price of one. 2-3x ITL improvement. Most hosted providers use it transparently; self-hosted setups need explicit configuration in vLLM 0.6+.
- Quantization: Q8 → Q4 → Q3 reduces model size, which reduces memory-bandwidth bottleneck, which reduces ITL. For self-hosted: see Qwen 3.5 GGUF quantization and KV cache quantization.
- Custom inference hardware: Cerebras / Groq for ~3x ITL improvement on supported models. Not for every workload but transformative for real-time UX.
- Batch decoding: serving multiple requests in parallel through a shared GPU lowers per-request ITL (better hardware utilization). Self-hosted vLLM does this automatically with continuous batching. Hosted APIs do too.
The UX Pattern: Streaming + Early Filler
Sub-200ms TTFT is hard. A widely-used UX pattern: while waiting for the LLM's first token, show a loading state or "thinking..." text immediately, then swap to streamed output as soon as it arrives. This makes 800ms TTFT feel like 0ms because the user gets immediate visual feedback.
Some apps go further: pre-generate the first 1-2 sentences with a smaller model (50-100ms TTFT) and stream those while the larger model produces the full response in the background. The first chunk feels instant; the full response is high-quality. The trick is preserving coherence across the swap.
Real-World Latency Budgets
| Use case | TTFT budget | Total time budget | Implication |
|---|---|---|---|
| Real-time chat (typing indicator) | under 300ms | under 8s for 500 tokens | Stream + Sonnet/Haiku |
| IDE inline suggestions | under 200ms | under 1s for 100 tokens | Tiny model + speculative decoding |
| Code review on save | under 1s | under 15s for 1500 tokens | Sonnet / Opus, streaming optional |
| Long-form drafting | under 2s | 30-60s acceptable | Opus + streaming hides total time |
| Background batch (CI fixers) | any | any (cost matters) | Cheapest model that works |
Decision Framework
- If TTFT matters most: cached system prompt, smaller model, geographic colocation, streaming UI.
- If ITL matters most: smaller model, speculative decoding, quantization, custom inference hardware (Cerebras / Groq) for supported models.
- If total time matters and tail is the issue: route to multiple providers, return first-arriving response. Adds cost but eliminates p99 spikes.
- If cost matters more than latency: largest model that fits the budget; latency optimizations second.
Frequently Asked Questions
What is TTFT in LLM apps?
Time To First Token — the duration from "user clicks send" to "first character of response appears." Typical: 300-1500ms depending on model size, prompt length, and queue depth. Dominated by prompt-processing time (the model building its KV cache before decoding) and any provider-side queueing. Prompt caching cuts TTFT 60-80% on cached requests.
What is ITL in LLM apps?
Inter-Token Latency — the time between successive tokens during decoding. Typical: 10-30ms per token. Determined by model size (bigger = slower), KV cache locality (long context = slower decode), and decoding strategy (greedy is fastest, speculative decoding is faster still). For a 500-token response, total decode time is ITL × 500 plus TTFT.
Why does streaming make LLM apps feel faster?
Humans react to first-token, not total time. A streaming response that starts at 300ms and completes at 12 seconds feels dramatically faster than a non-streaming response that delivers all 12 seconds at once. The brain interprets streaming as "the AI is responding" — once tokens start, the wait feels productive. Streaming is the single biggest perceived-latency optimization for LLM UX.
Why is Cerebras / Groq so much faster than Claude or GPT?
Custom inference hardware purpose-built for LLM serving. Cerebras has a wafer-scale chip; Groq has Language Processing Units (LPUs). Both achieve ITL of 2-4ms vs 10-15ms on standard GPUs. Trade-off: model selection is limited (mostly Llama family open-weights), and per-token quality is below frontier Claude / GPT / Gemini. For latency-sensitive UX with "good enough" quality, they're the right pick. For frontier-quality, stick with Claude / OpenAI / Gemini.
How can I reduce LLM latency in production?
Five biggest levers: (1) Aggressive prompt caching cuts TTFT 60-80% on cached requests. (2) Smaller model on ITL-bound paths (Sonnet/Haiku instead of Opus). (3) Streaming UI hides TTFT and total time. (4) Speculative decoding (transparent on hosted APIs, configurable in vLLM). (5) Custom inference hardware (Cerebras/Groq) for real-time UX. Combined, these typically take p50 latency from 8-15s on a long response to under 4s.
What's the difference between TTFT and total latency?
TTFT is the time before the first character appears. Total latency is TTFT + (ITL × output length). For user perception, TTFT dominates — a fast TTFT with longer total time often feels better than a slow TTFT with the same total time, because streaming provides immediate feedback. For backend cost / throughput, total latency matters more (it's what the GPU was busy for). Optimize TTFT for UX, total latency for backend efficiency.
Bottom Line
"LLM latency" is three different metrics: TTFT, ITL, and total time. Each has different causes (prompt processing, model size, decoding strategy) and different fixes (caching, smaller models, speculative decoding, custom hardware). For real-time UX, optimize TTFT and use streaming aggressively — first-token timing dominates user perception. For background workloads, optimize total time per request and accept higher TTFT. Measure both before optimizing — most "the LLM is slow" complaints turn out to be specific TTFT issues, ITL issues, or queueing issues, each with a different fix.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Multi-Cluster Kubernetes: Argo CD ApplicationSet Patterns
When 10+ clusters or 50+ services break hand-written GitOps. ApplicationSet's four generators (cluster list, Git directory, PR, cluster decision), real production patterns (env promotion, per-tenant, multi-region failover, preview envs), and the sharp edges (template debugging, cascading mistakes, RBAC).
11 min read
DevOpsPython uv vs pip vs Poetry vs PDM: Speed Benchmarks 2026
Real benchmarks: uv installs Django + ML stack in 8s vs pip's 90s, Poetry's 50s, PDM's 38s. Why uv is fast (Rust + parallelism + PubGrub), what pip still does that uv doesn't, migration paths, and where Poetry's ergonomics still win.
12 min read
AI/ML EngineeringSelf-Hosting LLMs from India: Providers, Latency & INR Pricing (2026)
A practical comparison of self-hosting LLMs on Indian GPU clouds including E2E Networks, Tata TIR, and Yotta Shakti Cloud, with INR pricing inclusive of 18% GST, latency tests from Mumbai, Bangalore, Chennai, and Delhi, and DPDP Act 2023 compliance notes.
15 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.