LLM Latency Explained: TTFT, ITL, Streaming UX (2026)

What "Latency" Actually Decomposes Into

For LLM-backed applications, "the API is slow" is not a useful metric. The latency a user sees decomposes into three distinct components, each with different causes and different fixes:

TTFT (Time To First Token): how long from "user clicks send" to "first character appears." Typical: 300-1500ms. Dominated by prompt-processing time, model load, and queue depth.
ITL (Inter-Token Latency): how long between successive tokens. Typical: 10-30ms per token. Determined by model size, decoding speed, and KV cache locality.
Total time: TTFT + (ITL × output_length). For a 500-token response, total is roughly TTFT + 5-15 seconds.

Streaming changes the perceived latency dramatically even when total time is identical. Humans react to first-token; once tokens start arriving, the brain interprets it as "the AI is responding" rather than "the AI is still thinking." Two responses with identical 12-second total time feel completely different if one shows the first token at 300ms and the other shows nothing for 12 seconds. Streaming is the single biggest perceived-latency optimization for LLM apps, and most apps that don't stream are leaving real UX value on the table.

For deeper LLM inference mechanics see LLM inference: tokens, context, and KV cache. This article is the practical playbook for measuring and optimizing TTFT and ITL specifically.

What Drives TTFT

Prompt-Processing Time (the dominant factor)

Before the model can produce the first output token, it processes the entire input prompt — building the KV cache that subsequent decoding will use. Time scales with: prompt length × model size. A 100K-token prompt processed by Opus 4.7 takes roughly 600-900ms before first token; the same prompt on Haiku 4.5 takes 200-300ms.

This is why LLM prompt caching is so impactful for TTFT — a cached prefix skips most of the prompt-processing time. Anthropic's 90% cache discount isn't just a cost saving; it's also a TTFT optimization that often reduces TTFT by 60-80% on cached requests.

Model Load and Cold Start

For self-hosted inference, the first request to a freshly-loaded model pays a cold-start penalty (200-2000ms) for KV cache initialization, weight paging, GPU memory layout. Hosted APIs (Anthropic, OpenAI, Gemini) hide this — they keep models warm — but for self-hosted vLLM / TGI / Triton deployments, cold start is a real concern. Either keep models always-warm or accept the latency on the first request after idle.

Queue Depth (the variable factor)

Hosted APIs have queues. If the provider's serving infrastructure is busy, your request waits. This is the source of high TTFT variance — p50 might be 400ms but p99 reaches 3000ms during peak hours. The provider's batch-decoding scheduler interleaves requests; if they're all decoding long outputs, your request waits longer for a slot.

For latency-sensitive UX, route to multiple providers and pick the fastest live response. For cost-sensitive workloads, accept the variance.

What Drives ITL

Model Size

Bigger models decode slower. Each token requires a forward pass through the entire model. Approximate ITL by model size on H100-class hardware:

Haiku 4.5 / Sonnet 4.6: 8-12ms ITL
Opus 4.7: 12-15ms ITL
GPT-5 / GPT-5.4: 10-15ms ITL
Gemini 3.1 Pro: 12-18ms ITL (varies more by region)
DeepSeek V4 (1T MoE, 32B active): 20-30ms ITL — the routing overhead between experts adds compute
Qwen 3.5 32B (self-hosted on H100): 8-12ms ITL

KV Cache Locality

For long sequences, KV cache fits in GPU memory but accesses are bandwidth-limited. As context grows past ~32K tokens, memory bandwidth becomes the bottleneck (not compute), and ITL grows. A 1M-token context decodes meaningfully slower than a 1K-token context — typically 2-4x slower per token.

For Gemini 3.1 Pro at 1.5M context, ITL stretches from ~12ms (short context) to ~40ms (max context). For interactive use, keep context under 200-400K tokens to maintain reasonable ITL.

Decoding Strategy

Greedy / temp 0: fastest decoding (single token per pass)
Beam search: 2-4x slower (multiple parallel candidates) — rarely used in 2026
Speculative decoding: a small "draft" model proposes tokens, the big model verifies. 2-3x ITL improvement on average. Many providers use it transparently. See "speculative decoding" below.

Provider Differences That Matter

Provider	Typical TTFT	Typical ITL	Notes
Anthropic Sonnet 4.6	~280ms	~10ms	Lowest TTFT for typical Sonnet workload in 2026
Anthropic Opus 4.7	~450ms	~14ms	Bigger model, slightly higher TTFT and ITL
OpenAI GPT-5.4	~380ms	~12ms	Comparable to Sonnet
Gemini 3.1 Pro (Vertex AI)	~520ms	~14ms	Slightly higher TTFT due to Google's auth overhead
Cerebras (Llama 4 405B)	~150ms	~2-3ms	Wafer-scale chip, dramatically lower ITL
Groq (Llama 4)	~140ms	~3-4ms	Custom LPU hardware, very low ITL
Self-hosted vLLM Qwen 3.5 32B (1× H100)	~250ms	~10ms	Comparable to hosted, depends on tuning

Cerebras and Groq stand out because they're purpose-built inference hardware (CS-2 wafer-scale, LPU respectively) optimized specifically for low ITL. The trade-off: model selection is limited (mostly Llama family open-weights), and quality-per-token is below frontier-tier Claude / GPT / Gemini. For real-time UX where ITL is the bottleneck and "good enough" beats "best," they're the right pick. For frontier-quality work, Anthropic / OpenAI / Gemini.

How to Measure TTFT and ITL

vLLM /metrics Endpoint

import requests
metrics = requests.get('http://localhost:8000/metrics').text

# Look for these Prometheus metrics:
# vllm:time_to_first_token_seconds (histogram)
# vllm:time_per_output_token_seconds (histogram)
# vllm:request_prompt_tokens (histogram)
# vllm:e2e_request_latency_seconds (histogram)

OpenAI / Anthropic SDK Timing

import time
from anthropic import Anthropic

client = Anthropic()
start = time.time()

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=500,
    messages=[{"role": "user", "content": "Hello"}]
) as stream:
    first_token_time = None
    token_times = []
    for text in stream.text_stream:
        now = time.time()
        if first_token_time is None:
            first_token_time = now
            ttft = first_token_time - start
        else:
            token_times.append(now - last_time)
        last_time = now

itl_p50 = sorted(token_times)[len(token_times) // 2]
print(f"TTFT: {ttft * 1000:.0f}ms, ITL p50: {itl_p50 * 1000:.0f}ms")

Custom Logger Middleware (for OpenAI-compatible APIs)

Wrap the OpenAI SDK with a streaming-aware middleware that records first-token-arrival and subsequent inter-token deltas. Many production stacks use this with Datadog or Honeycomb tracing. For broader observability patterns see AI observability.

How to Optimize TTFT

Aggressive prompt caching: 60-80% TTFT improvement on cached requests. The single highest-leverage optimization. See Anthropic vs OpenAI prompt caching for provider details.
Smaller model for TTFT-bound calls: if your workload is "many short responses where TTFT matters more than depth," route to Sonnet / Haiku instead of Opus. The TTFT difference is often 100-200ms.
Geographic colocation: API calls from the same region as the provider have ~40-80ms lower TTFT than cross-region. For India see India cloud latency.
Connection reuse and HTTP/2: avoid TLS handshake on every request. Use HTTP/2 or HTTP/3 for multiplexing. The SDK handles this; verify if you've written your own client.
Pre-warmed sessions: for self-hosted vLLM, keep the GPU warm with periodic dummy requests during off-hours. Cold-start penalty is real.

How to Optimize ITL

Smaller model for ITL-bound use cases: a Sonnet 4.6 response at 10ms/token completes a 500-token output in 5 seconds. An Opus 4.7 response at 14ms/token takes 7 seconds. For real-time UX where ITL matters more than depth, prefer smaller.
Speculative decoding: a small draft model proposes 4-8 tokens, the big model verifies. If the verification confirms, you get 4-8 tokens for the price of one. 2-3x ITL improvement. Most hosted providers use it transparently; self-hosted setups need explicit configuration in vLLM 0.6+.
Quantization: Q8 → Q4 → Q3 reduces model size, which reduces memory-bandwidth bottleneck, which reduces ITL. For self-hosted: see Qwen 3.5 GGUF quantization and KV cache quantization.
Custom inference hardware: Cerebras / Groq for ~3x ITL improvement on supported models. Not for every workload but transformative for real-time UX.
Batch decoding: serving multiple requests in parallel through a shared GPU lowers per-request ITL (better hardware utilization). Self-hosted vLLM does this automatically with continuous batching. Hosted APIs do too.

The UX Pattern: Streaming + Early Filler

Sub-200ms TTFT is hard. A widely-used UX pattern: while waiting for the LLM's first token, show a loading state or "thinking..." text immediately, then swap to streamed output as soon as it arrives. This makes 800ms TTFT feel like 0ms because the user gets immediate visual feedback.

Some apps go further: pre-generate the first 1-2 sentences with a smaller model (50-100ms TTFT) and stream those while the larger model produces the full response in the background. The first chunk feels instant; the full response is high-quality. The trick is preserving coherence across the swap.

Real-World Latency Budgets

Use case	TTFT budget	Total time budget	Implication
Real-time chat (typing indicator)	under 300ms	under 8s for 500 tokens	Stream + Sonnet/Haiku
IDE inline suggestions	under 200ms	under 1s for 100 tokens	Tiny model + speculative decoding
Code review on save	under 1s	under 15s for 1500 tokens	Sonnet / Opus, streaming optional
Long-form drafting	under 2s	30-60s acceptable	Opus + streaming hides total time
Background batch (CI fixers)	any	any (cost matters)	Cheapest model that works

Decision Framework

If TTFT matters most: cached system prompt, smaller model, geographic colocation, streaming UI.
If ITL matters most: smaller model, speculative decoding, quantization, custom inference hardware (Cerebras / Groq) for supported models.
If total time matters and tail is the issue: route to multiple providers, return first-arriving response. Adds cost but eliminates p99 spikes.
If cost matters more than latency: largest model that fits the budget; latency optimizations second.

Frequently Asked Questions

What is TTFT in LLM apps?

Time To First Token — the duration from "user clicks send" to "first character of response appears." Typical: 300-1500ms depending on model size, prompt length, and queue depth. Dominated by prompt-processing time (the model building its KV cache before decoding) and any provider-side queueing. Prompt caching cuts TTFT 60-80% on cached requests.

What is ITL in LLM apps?

Inter-Token Latency — the time between successive tokens during decoding. Typical: 10-30ms per token. Determined by model size (bigger = slower), KV cache locality (long context = slower decode), and decoding strategy (greedy is fastest, speculative decoding is faster still). For a 500-token response, total decode time is ITL × 500 plus TTFT.

Why does streaming make LLM apps feel faster?

Humans react to first-token, not total time. A streaming response that starts at 300ms and completes at 12 seconds feels dramatically faster than a non-streaming response that delivers all 12 seconds at once. The brain interprets streaming as "the AI is responding" — once tokens start, the wait feels productive. Streaming is the single biggest perceived-latency optimization for LLM UX.

Why is Cerebras / Groq so much faster than Claude or GPT?

Custom inference hardware purpose-built for LLM serving. Cerebras has a wafer-scale chip; Groq has Language Processing Units (LPUs). Both achieve ITL of 2-4ms vs 10-15ms on standard GPUs. Trade-off: model selection is limited (mostly Llama family open-weights), and per-token quality is below frontier Claude / GPT / Gemini. For latency-sensitive UX with "good enough" quality, they're the right pick. For frontier-quality, stick with Claude / OpenAI / Gemini.

How can I reduce LLM latency in production?

Five biggest levers: (1) Aggressive prompt caching cuts TTFT 60-80% on cached requests. (2) Smaller model on ITL-bound paths (Sonnet/Haiku instead of Opus). (3) Streaming UI hides TTFT and total time. (4) Speculative decoding (transparent on hosted APIs, configurable in vLLM). (5) Custom inference hardware (Cerebras/Groq) for real-time UX. Combined, these typically take p50 latency from 8-15s on a long response to under 4s.

What's the difference between TTFT and total latency?

TTFT is the time before the first character appears. Total latency is TTFT + (ITL × output length). For user perception, TTFT dominates — a fast TTFT with longer total time often feels better than a slow TTFT with the same total time, because streaming provides immediate feedback. For backend cost / throughput, total latency matters more (it's what the GPU was busy for). Optimize TTFT for UX, total latency for backend efficiency.

Bottom Line

"LLM latency" is three different metrics: TTFT, ITL, and total time. Each has different causes (prompt processing, model size, decoding strategy) and different fixes (caching, smaller models, speculative decoding, custom hardware). For real-time UX, optimize TTFT and use streaming aggressively — first-token timing dominates user perception. For background workloads, optimize total time per request and accept higher TTFT. Measure both before optimizing — most "the LLM is slow" complaints turn out to be specific TTFT issues, ITL issues, or queueing issues, each with a different fix.

LLM Latency: TTFT, ITL, and Why End-User Latency Isn't What You Think