Skip to content
AI/ML Engineering

vLLM vs TGI vs Triton: LLM Inference Server Comparison

Production LLM serving with vLLM 0.7, TGI 3.0, and NVIDIA Triton + TensorRT-LLM. Llama 3.1 70B H100 benchmarks, FP8 KV-cache numbers, $/1M token math, and a decision framework for picking the right server per team shape.

A
Abhishek Patel18 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

vLLM vs TGI vs Triton: LLM Inference Server Comparison
vLLM vs TGI vs Triton: LLM Inference Server Comparison

Quick Answer: Which LLM Inference Server Wins in Production?

vLLM vs TGI vs Triton is the 2026 production-serving fight for open-weight LLMs, and the answer splits cleanly by team shape. For a pure LLM API that needs maximum throughput on open-weight models, vLLM 0.7 with FP8 on H100 delivers 4,200 output tokens/sec at batch 64 on Llama 3.1 70B — the highest sustained throughput in the public benchmarks I trust, and it ships as a first-class OpenAI-compatible server. For a Hugging Face shop that wants the shortest path from a model card to a serving endpoint, TGI 3.0 (Text Generation Inference) is the least-drama choice: one docker run, tight HF Hub integration, and Rust-based request routing that holds up under real traffic. For a heterogeneous serving fleet where LLMs share hardware with vision, audio, and classical ML models, NVIDIA Triton with TensorRT-LLM backend is the only option that ties them together — model repository, dynamic batching across frameworks, and the best raw latency per token on H100 with TRT-LLM kernels. The honest rule: if every model in your stack is a decoder-only LLM, pick vLLM. If you live in HF-land, pick TGI. If you need to serve Whisper, ResNet, and Llama behind one gateway, pick Triton.

Last updated: April 2026 — verified vLLM 0.7.3, TGI 3.0.1, Triton Inference Server 24.11 with TensorRT-LLM 0.15 backend, benchmark numbers re-sampled on H100 80GB SXM (April 2026), and pricing for the three managed equivalents.

Hero Comparison: vLLM vs TGI vs Triton at a Glance

ServerStarting PointLicenseBest ForKey Differentiator
vLLM 0.7Free (Apache 2.0), self-hostedApache 2.0High-throughput open LLM APIsPagedAttention + continuous batching — highest throughput on decoder-only LLMs
TGI 3.0Free (Apache 2.0), self-hosted; HF Inference Endpoints from $0.60/hrApache 2.0 (server), HFOIL on some kernelsHugging Face-centric stacksRust web layer, native HF Hub loading, production stability since 2023
Triton Inference Server 24.11Free (BSD-3), self-hostedBSD-3-ClauseMulti-model, multi-framework fleetsBackend plugin model — TensorRT-LLM, vLLM, PyTorch, ONNX, TensorFlow all under one server

This is the production-serving deep dive. If you're instead looking at single-machine local inference (Ollama, llama.cpp, desktop vLLM), see our Ollama vs vLLM vs llama.cpp breakdown — that article is where the engines meet a developer laptop, this one is where they meet a datacenter. The production tuning patterns I've hit serving Llama 3.1 70B to real traffic — the NCCL stalls under continuous batching, the chunked-prefill tuning that actually matters, and when to bypass vLLM's scheduler entirely — I send to the newsletter.

What Is an LLM Inference Server?

Definition: An LLM inference server is a long-running process that loads model weights into GPU memory, exposes an HTTP or gRPC API, and serves concurrent token-generation requests using batching, KV-cache management, and scheduling optimized for autoregressive decoding. Unlike a training framework or a one-shot CLI, an inference server's job is throughput under concurrency — keeping the GPU fed with tokens while latency stays bounded per request.

The three servers solve the same problem with different centers of gravity. vLLM started as a PagedAttention research project at UC Berkeley and became the throughput king for decoder-only transformers. TGI is Hugging Face's production serving layer, tightly coupled to the HF ecosystem. NVIDIA Triton predates the LLM era — a general-purpose model server that added first-class LLM support via the TensorRT-LLM backend in late 2023. You pay the inference-server tax on every production LLM call; tokens/sec per dollar is decided by three things — continuous batching efficiency, KV-cache management, and kernel quality for your specific model + quantization.

vLLM: The Throughput King on Open-Weight LLMs

vLLM's thesis: KV-cache is the bottleneck, so manage it like virtual memory. PagedAttention slices the cache into fixed-size blocks (typically 16 tokens) and keeps a block table per sequence — identical to how an OS maps virtual pages to physical frames. Near-zero memory fragmentation enables continuous batching, where new requests join the in-flight batch between forward passes rather than waiting for synchronous batch boundaries.

On my April 2026 H100 80GB SXM sample, vLLM 0.7.3 with FP8 KV-cache and Llama 3.1 70B hit 4,220 output tokens/sec at batch 64 (512 in, 256 out). Naive batched HF Transformers on the same hardware: 640 tok/s. That's 6.6x — real steady-state under concurrent traffic, not a microbenchmark. The vLLM team has been aggressive about FP8, speculative decoding (Medusa, Eagle, n-gram), and chunked-prefill, all of which move tail latency.

# Production vLLM serve on H100 with FP8 KV-cache
docker run --gpus all --ipc=host -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.7.3 \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --dtype bfloat16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --enable-chunked-prefill \
  --max-num-seqs 256

# OpenAI-compatible — drop-in replacement for openai.ChatCompletion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-70B-Instruct",
       "messages": [{"role": "user", "content": "Explain PagedAttention"}],
       "max_tokens": 256}'

Where vLLM falls apart: it's a decoder-only LLM server. Vision models, audio models, embedding models that aren't decoder-only LLMs — vLLM either doesn't support them or supports them as second-class citizens. The Python scheduler has also been a source of production pain — I've seen NCCL all-reduce stalls under high concurrency that required dropping --max-num-seqs from 512 down to 128 to stabilize, and the debug story for scheduler-level issues is still immature compared to Triton. On AMD MI300X, vLLM works but lags NVIDIA by roughly 20-30% on kernel efficiency as of Q1 2026, even with ROCm-optimized builds.

TGI: Hugging Face's Production Serving Layer

TGI is what Hugging Face runs behind their Inference Endpoints product: a Rust web layer (axum + tokio) handles routing, queueing, and SSE streaming; a Python "shard" process per GPU owns the model. TGI's differentiator isn't raw throughput — vLLM wins that — it's operational fit for HF-centric teams.

On my April 2026 Llama 3.1 70B bench (same H100 SXM node, TP 4, bfloat16), TGI 3.0.1 hit 2,890 output tokens/sec at batch 64 — roughly 69% of vLLM's throughput. With FP8 KV cache (KV_CACHE_DTYPE=fp8_e5m2) it moved to 3,480 tok/s, closing to ~82% of vLLM. Honest numbers — what you see when you deploy, not a cherry-picked config claim.

# TGI production serve — one docker run, native HF Hub load
docker run --gpus all --shm-size 1g -p 8080:80 \
  -v $PWD/data:/data \
  -e HF_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:3.0.1 \
  --model-id meta-llama/Llama-3.1-70B-Instruct \
  --num-shard 4 \
  --max-input-length 8192 \
  --max-total-tokens 32768 \
  --quantize fp8 \
  --max-batch-prefill-tokens 16384

# OpenAI-compatible /v1/chat/completions as of TGI 1.4+
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-70B-Instruct",
       "messages": [{"role": "user", "content": "Hi"}]}'

Where TGI falls apart: license-wise, a subset of the optimized kernels are under Hugging Face's HFOIL license, which forbids commercial resale of the TGI service itself — fine if you're serving your own app, a problem if you're building a managed inference platform. Throughput ceiling is 10-25% below vLLM on equivalent hardware in most of my tests, which stops mattering only if your traffic doesn't saturate the GPU. And multi-model serving is weak — TGI is one model per server, full stop. If you need model-switching or A/B testing different weights behind one endpoint, you're building that yourself or moving to Triton.

Pricing Comparison: Self-Hosted GPU Cost and Managed Alternatives

All three servers are free open source, so the real cost is GPU hardware and the managed alternatives. Here's what an 8x H100 80GB serving deployment actually costs in April 2026, plus the closest managed equivalent for each:

Serving OptionHourly Cost (8x H100)Effective $/1M output tokens (Llama 3.1 70B)Notes
vLLM self-hosted on Lambda reserved 8x H100 SXM$14.80/hr~$0.98 (at 4,200 tok/s sustained)Best $/token on open-weight models; ops overhead on you
vLLM self-hosted on RunPod Secure Cloud 8x H100$22.32/hr~$1.4830-second spin-up; see RunPod vs Vast.ai vs Lambda for the full GPU-cloud math
TGI on HF Inference Endpoints (dedicated 8x H100)~$28/hr list (negotiable on commit)~$1.85Zero-ops; HF manages upgrades, scaling, monitoring
Triton on AWS SageMaker ml.p5.48xlarge (8x H100)$98.32/hr on-demand, ~$39/hr 3yr reserved~$6.50 on-demand, ~$2.60 reservedTight IAM/VPC story; SageMaker wrapper taxes the raw rate heavily
vLLM on Fireworks/Together serverless (pay per token)N/A — tokens only$0.90-$1.20/1M output tokens (Llama 3.1 70B list)No ops, but 2-5x more expensive above 5M tokens/day; see LLM API pricing

The break-even is roughly 20 million output tokens/day: below that, serverless/managed is cheaper than operating your own 8x H100 node. Above that, self-hosted vLLM on Lambda reserved H100 SXM is the cheapest serious path. Triton wrapped in SageMaker is almost never the cheapest — the AWS premium on p5 is brutal — but it's the right answer if your platform already commits to AWS and wants SageMaker endpoints, shadow deployments, and VPC isolation as table stakes. For the full cross-provider GPU math see our cloud GPU provider comparison and the spot instance economics principles that apply to inference.

NVIDIA Triton: The Multi-Model Fleet Server

Triton is the most general and the oldest — NVIDIA released TensorRT Inference Server in 2018, renamed it Triton in 2020, and shipped the TensorRT-LLM backend in October 2023. Triton's model is backends: a C++ runtime loads plugin shared libraries for each framework — PyTorch, ONNX, TensorFlow, Python custom, vLLM (yes, as a Triton backend), and TensorRT-LLM. One server, one model repository, N frameworks.

For pure LLM throughput, Triton + TensorRT-LLM is competitive with vLLM. On my April 2026 Llama 3.1 70B bench (same 8x H100 node, TRT-LLM engine FP8, TP 4, in-flight batching), Triton hit 4,310 output tok/s at batch 64 — a hair above vLLM's 4,220. But the engine-build step is its own operational tax: 15-45 minutes to compile a TRT-LLM engine per model + GPU + precision combo, and an H100 engine won't run on A100. vLLM loads any HF checkpoint in 2 minutes; Triton + TRT-LLM is faster at runtime but slower to change.

# Triton with TensorRT-LLM backend, production deployment
# Step 1: build the TRT-LLM engine (one-time per model + GPU + precision)
python -m tensorrt_llm.commands.build \
  --checkpoint_dir ./llama-3.1-70b-hf-fp8 \
  --output_dir ./engines/llama-3.1-70b-fp8-tp4 \
  --gemm_plugin fp8 \
  --use_fp8_context_fmha enable \
  --max_batch_size 128 \
  --max_input_len 8192 \
  --max_seq_len 32768 \
  --tp_size 4

# Step 2: serve via Triton with the trtllm backend
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v $PWD/triton_repo:/models \
  nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3 \
  tritonserver --model-repository=/models \
    --allow-http=true --allow-grpc=true

Where Triton falls apart: the learning curve is steeper than the other two combined. You're writing config.pbtxt files, managing a model repository directory layout, and when TensorRT-LLM throws an error on a shape-mismatch you need to decode the engine log rather than read a Python traceback. The OpenAI-compatible endpoint is a wrapper you bolt on (NVIDIA ships one via the trtllm_backend repo) rather than a first-class feature. For LLM-only shops without a vision/audio ML surface, Triton is almost always overkill — and the Triton team acknowledges this, which is why they ship a vLLM backend: use Triton's routing, use vLLM's scheduler.

Benchmark: Llama 3.1 70B Throughput and Latency Under Load

Here are the April 2026 numbers I sampled on an 8x H100 80GB SXM node, Llama 3.1 70B Instruct, tensor-parallel 4 on 4 GPUs (I used the other 4 for a second replica to measure real concurrency). Input length 512 tokens, output length 256 tokens, mixed arrival rate. Each number is the median of 3 runs, each 5 minutes long after a 60-second warmup.

MetricvLLM 0.7.3 (FP8 KV)TGI 3.0.1 (FP8 KV)Triton + TRT-LLM 0.15 (FP8)
Throughput @ batch 1 (tok/s, TTFT ~ms)82 tok/s, 145 ms TTFT78 tok/s, 168 ms TTFT91 tok/s, 110 ms TTFT
Throughput @ batch 161,240 tok/s1,060 tok/s1,320 tok/s
Throughput @ batch 644,220 tok/s3,480 tok/s4,310 tok/s
Throughput @ batch 128 (saturation)5,180 tok/s3,940 tok/s5,260 tok/s
p99 tail latency @ batch 642.1 s2.8 s1.9 s
Cold start (model load)~85 s~65 s~4 s (engine) + 20-45 min (build)
Memory efficiency (max KV / VRAM)0.91 (PagedAttention)0.820.88 (paged KV via TRT-LLM)

Triton + TensorRT-LLM wins on raw latency and a sliver on saturation throughput; vLLM wins on ease of deployment and batch-64 steady-state; TGI trails on throughput but wins on HF-ecosystem ergonomics. The Triton-vLLM gap at saturation is roughly 1.5% — within noise on most production deployments. Pick on which server your team can operate for the next 18 months, not on this table.

flowchart TB
  W[Inference workload] --> Q{Every model decoder-only LLM?}
  Q -->|Yes, only LLMs| V{HF-centric team?}
  Q -->|No, mixed modalities| T[Triton + backends]
  V -->|Yes, lives on HF Hub| G[TGI]
  V -->|No, wants max throughput| VL[vLLM]
  T --> TT{Need absolute lowest latency?}
  TT -->|Yes, can afford engine builds| TR[Triton + TensorRT-LLM]
  TT -->|No, want flexibility| TV[Triton with vLLM backend]
vLLM vs TGI vs Triton GPU inference server benchmark dashboard on H100 hardware
Production LLM inference servers share the same hardware target — how they schedule and batch is what moves the throughput-per-dollar number.

Operational Story: Monitoring, Scaling, Kubernetes

Production serving isn't just tokens per second. vLLM exposes Prometheus metrics at /metrics (request latency, KV-cache hit rate, pending/running gauges). Horizontal scaling on Kubernetes is standard — a GPU-typed Deployment, a community Helm chart, readiness probes tied to /health. One subtle thing: vLLM's in-process scheduler hates aggressive autoscaling — bouncing pods under load invalidates KV cache. Set HorizontalPodAutoscaler cooldowns to 3-5 minutes. For the full GPU-scheduling picture see Kubernetes GPU scheduling.

TGI ships Prometheus metrics, structured JSON logs, and first-class OpenTelemetry traces (TGI was earlier to this than the other two). Queue semantics are more explicit than vLLM's — --max-waiting-tokens and --max-concurrent-requests are knobs you set, which makes capacity planning slightly easier.

Triton has the most mature ops story — metrics, tracing, gRPC + HTTP dual protocol, a Kubernetes Operator, and first-class model versioning and shadow deploys. If you're building a platform team that owns model serving as a product, Triton is the only server here with platform-grade tooling. The AI observability patterns that work in production — prompt-level token cost tracking, per-model p99 alerts — are easiest to bolt onto Triton because it has the hooks.

Quantization, Speculative Decoding, and Multi-LoRA

All three servers support FP8 on H100-class hardware in 2026, but the kernel paths differ. vLLM uses Transformer Engine FP8 with its own FP8 KV-cache. TGI mixes FBGEMM and TRT-LLM kernels for FP8 GEMMs. Triton + TRT-LLM uses TensorRT's native FP8 pipeline end-to-end, extracting slightly more throughput on pure GEMM-bound workloads at the cost of the 20-45 minute engine build tax. Beyond FP8: vLLM supports AWQ, GPTQ, Marlin, FP8 natively; TGI supports bitsandbytes, GPTQ, AWQ, EETQ; Triton/TRT-LLM runs SmoothQuant + INT8 + FP8 with weaker AWQ support as of Q1 2026. The KV cache mechanics piece covers the theory these tuning knobs fall out of.

Two production techniques matter: speculative decoding (small draft model proposes tokens a big target verifies, cutting latency 2-3x) and multi-LoRA serving (hot-swapping adapters without reloading base weights, cutting multi-tenant cost 5-20x). vLLM ships the most mature speculative decoding — Medusa, Eagle, n-gram, draft-model-based are all first-class. Multi-LoRA with Punica-style batching is production-ready as of vLLM 0.6. TGI has speculative decoding via Medusa plus native LoRA hot-reload (HF is heavily invested here for Inference Endpoints multi-tenancy). Triton + TRT-LLM supports Medusa and draft-based speculation but the deployment story is more complex. For multi-tenant SaaS with fine-tuned variants, TGI's LoRA story is the least painful.

Which Server Should You Pick? A Concrete Decision Matrix

After deploying all three at scale, here's the decision framework I actually use:

  • Pick vLLM if: your stack is open-weight LLMs only, you need maximum tokens/sec per dollar on H100, and your team can operate a Python-heavy service without needing an ops platform wrapper. This is the default for pure LLM API companies.
  • Pick TGI if: you live in Hugging Face's ecosystem, you want the shortest path from a model card to a serving endpoint, or you plan to use HF Inference Endpoints for at least part of your deployment. TGI is also the right pick when multi-LoRA SaaS tenancy is your main use case.
  • Pick Triton with TensorRT-LLM if: you serve a mix of LLMs and other models (vision, audio, embedding) behind a single gateway, or your platform team wants a server with enterprise ops primitives (model versioning, shadow deploys, first-class metrics/tracing, gRPC + HTTP).
  • Pick Triton with the vLLM backend if: you want Triton's platform surface (model repo, routing, metrics) but vLLM's scheduler performance on decoder-only models. This is underrated — you get the best of both, with the caveat of learning Triton's config format.
  • Skip all three and use a managed API if: you serve less than 20 million output tokens/day and your ops team has more valuable things to do than babysit GPU servers. See the math in the pricing section above.

Pro tip: If you're early and unsure, start with vLLM on a single 8xH100 node and measure. The operational cost of moving from vLLM to TGI or Triton later is 2-3 weeks of engineering work — not the six months of rewrites that a database change would demand. Optimizing for the wrong server upfront is usually the bigger cost.

Frequently Asked Questions

What is the fastest LLM inference server for production?

Triton + TensorRT-LLM narrowly beats vLLM on raw throughput and latency on H100 — about 4,310 tok/s vs 4,220 tok/s on Llama 3.1 70B at batch 64 in my April 2026 bench. vLLM is a closer second and wins on ease of deployment. TGI runs 15-25% behind both on equivalent hardware in most of my tests. The honest answer: pick the one your team can operate — the throughput gap between vLLM and Triton is rarely the dominant cost.

Is vLLM better than TGI?

For raw throughput and latest decoding research (Medusa, Eagle, FP8 KV-cache, chunked prefill), yes — vLLM lands features faster and runs 15-25% more tokens/sec on the same hardware. For Hugging Face ecosystem integration, multi-LoRA serving, and operational stability, TGI is the better fit. If your team deploys models from the HF Hub daily, TGI's native integration removes friction vLLM users work around manually.

When should I use Triton Inference Server?

Use Triton when you serve more than just LLMs behind one gateway — vision models, audio models, classical ML, and LLMs sharing a model repository. Triton's backend plugin model (PyTorch, ONNX, TensorFlow, TensorRT-LLM, vLLM) is unique. For LLM-only workloads, Triton is overkill; vLLM or TGI will serve you faster with less operational complexity.

Does vLLM support OpenAI-compatible API?

Yes. vLLM ships an OpenAI-compatible server by default at /v1/chat/completions and /v1/completions. Drop-in replacement for the OpenAI Python SDK — change the base URL and you're done. TGI also ships OpenAI-compatible endpoints as of version 1.4. Triton requires a wrapper (NVIDIA provides one in the trtllm_backend repo) to expose OpenAI format.

How much GPU memory does vLLM need for Llama 3 70B?

In bfloat16, Llama 3 70B needs roughly 140 GB of weights alone — so tensor-parallel 2 on 2x H100 80GB (160 GB total) is the minimum, with limited KV cache headroom. The practical configuration is tensor-parallel 4 on 4x H100 SXM, which gives you ~320 GB total, 140 GB for weights, and ~160 GB for PagedAttention KV cache at 32K context length. FP8 cuts weight memory roughly in half.

Can you run vLLM on AMD GPUs?

Yes, vLLM supports AMD MI250X and MI300X via ROCm, and an Intel Gaudi2/3 backend exists. In April 2026 the AMD kernels are roughly 20-30% behind NVIDIA on equivalent throughput — closing, but not closed. TGI has similar AMD support through the same ROCm path. Triton + TensorRT-LLM is NVIDIA-only; AMD users would run Triton with its vLLM backend on ROCm instead.

What is the difference between vLLM and TensorRT-LLM?

vLLM is a complete inference server (scheduler + engine + API). TensorRT-LLM is just the engine — you run it inside Triton (or inside vLLM as a backend) to get a server. TensorRT-LLM engines are pre-compiled for a specific model + GPU + precision combination, which takes 15-45 minutes per build but yields slightly faster kernels at runtime. vLLM loads any HF checkpoint in 2 minutes with no build step.

The Bottom Line

In 2026, if you're deploying vLLM vs TGI vs Triton for a production LLM service, the decision is less about benchmark winners and more about ecosystem alignment. vLLM is the throughput-first default for open-weight LLM APIs — pick it unless a specific constraint pushes you elsewhere. TGI is the right pick for HF-centric teams or multi-LoRA SaaS tenancy. Triton is the platform-team answer when LLMs are one workload among many, and Triton-with-vLLM-backend is the underrated sweet spot when you want Triton's ops primitives without sacrificing vLLM's scheduler performance. The benchmark gap between the two throughput leaders — roughly 1-3% at saturation — is inside the noise floor of most production deployments. Pick on team fit, deploy, measure, then tune. That sequence has burned me fewer times than optimizing on the wrong axis upfront.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.