Skip to content

#Python

41 articles

LLM Latency: TTFT, ITL, and Why End-User Latency Isn't What You Think
AI/ML Engineering

LLM Latency: TTFT, ITL, and Why End-User Latency Isn't What You Think

LLM latency decomposes into TTFT (time to first token, 300-1500ms), ITL (inter-token, 10-30ms), and total time. Each has different causes and fixes. Why streaming dominates UX, when Cerebras/Groq beat Claude on speed, and the optimization playbook.

11 min read·
Python uv vs pip vs Poetry vs PDM: Speed Benchmarks 2026
DevOps

Python uv vs pip vs Poetry vs PDM: Speed Benchmarks 2026

Real benchmarks: uv installs Django + ML stack in 8s vs pip's 90s, Poetry's 50s, PDM's 38s. Why uv is fast (Rust + parallelism + PubGrub), what pip still does that uv doesn't, migration paths, and where Poetry's ergonomics still win.

12 min read·
Self-Hosting LLMs from India: Providers, Latency & INR Pricing (2026)
AI/ML Engineering

Self-Hosting LLMs from India: Providers, Latency & INR Pricing (2026)

A practical comparison of self-hosting LLMs on Indian GPU clouds including E2E Networks, Tata TIR, and Yotta Shakti Cloud, with INR pricing inclusive of 18% GST, latency tests from Mumbai, Bangalore, Chennai, and Delhi, and DPDP Act 2023 compliance notes.

15 min read·
Qwen 3 vs Qwen 3.5: What Changed & Should You Upgrade
AI/ML Engineering

Qwen 3 vs Qwen 3.5: What Changed & Should You Upgrade

Qwen 3.5 wins on long context, code, and agentic math (AIME +25.8 at 72B) — but the 72B license shifted from Apache 2.0 to a community license and LoRA adapters do not port. Full architecture, benchmark, and migration breakdown.

15 min read·
Qwen 3.5 VRAM Requirements: Every Model Size & Quantization
AI/ML Engineering

Qwen 3.5 VRAM Requirements: Every Model Size & Quantization

Full VRAM matrix for every Qwen 3.5 model from 0.5B to 397B across 8 quantization levels. GPU tier picks, CPU/RAM fallback, llama.cpp and vLLM launch flags.

16 min read·
Claude Agent SDK: Build Custom AI Agents
AI/ML Engineering

Claude Agent SDK: Build Custom AI Agents

Build production Claude agents in TypeScript or Python with the official Agent SDK. Tool-use loop, MCP integration, extended thinking, guardrails, and observability — end-to-end tutorial in under 45 minutes.

16 min read·
Qwen 3.5 on Apple Silicon: M3/M4 Tokens-per-Second
AI/ML Engineering

Qwen 3.5 on Apple Silicon: M3/M4 Tokens-per-Second

Qwen 3.5 hits 70-92 tok/s on M4 Max with MLX and 22 tok/s on 16 GB M4 base. Per-chip tables (M3 through M4 Ultra), MLX vs llama.cpp, thermal throttling, and when unified memory beats an RTX 4090.

15 min read·
RAG vs Fine-Tuning vs Long Context in 2026: A Decision Guide
AI/ML Engineering

RAG vs Fine-Tuning vs Long Context in 2026: A Decision Guide

The 2026 refresh: 1M-token contexts, LoRA fine-tuning, RAG still the bread-and-butter. What each is best at, the cost math at realistic scale, hybrid patterns production uses, and why 'long context replaces RAG' got it wrong.

11 min read·
vLLM vs TGI vs Triton: LLM Inference Server Comparison
AI/ML Engineering

vLLM vs TGI vs Triton: LLM Inference Server Comparison

Production LLM serving with vLLM 0.7, TGI 3.0, and NVIDIA Triton + TensorRT-LLM. Llama 3.1 70B H100 benchmarks, FP8 KV-cache numbers, $/1M token math, and a decision framework for picking the right server per team shape.

18 min read·
Page 1 of 5Older →

Stay in the loop

New articles delivered to your inbox. No spam.