Skip to content

AI/ML Engineering

Practical AI and machine learning engineering. LLM inference, tokenization, RAG pipelines, model deployment, vector databases, and the infrastructure behind modern AI applications.

54 articles

LLM Latency: TTFT, ITL, and Why End-User Latency Isn't What You Think
AI/ML Engineering

LLM Latency: TTFT, ITL, and Why End-User Latency Isn't What You Think

LLM latency decomposes into TTFT (time to first token, 300-1500ms), ITL (inter-token, 10-30ms), and total time. Each has different causes and fixes. Why streaming dominates UX, when Cerebras/Groq beat Claude on speed, and the optimization playbook.

11 min read·
Self-Hosting LLMs from India: Providers, Latency & INR Pricing (2026)
AI/ML Engineering

Self-Hosting LLMs from India: Providers, Latency & INR Pricing (2026)

A practical comparison of self-hosting LLMs on Indian GPU clouds including E2E Networks, Tata TIR, and Yotta Shakti Cloud, with INR pricing inclusive of 18% GST, latency tests from Mumbai, Bangalore, Chennai, and Delhi, and DPDP Act 2023 compliance notes.

15 min read·
Qwen 3 vs Qwen 3.5: What Changed & Should You Upgrade
AI/ML Engineering

Qwen 3 vs Qwen 3.5: What Changed & Should You Upgrade

Qwen 3.5 wins on long context, code, and agentic math (AIME +25.8 at 72B) — but the 72B license shifted from Apache 2.0 to a community license and LoRA adapters do not port. Full architecture, benchmark, and migration breakdown.

15 min read·
Qwen 3.5 VRAM Requirements: Every Model Size & Quantization
AI/ML Engineering

Qwen 3.5 VRAM Requirements: Every Model Size & Quantization

Full VRAM matrix for every Qwen 3.5 model from 0.5B to 397B across 8 quantization levels. GPU tier picks, CPU/RAM fallback, llama.cpp and vLLM launch flags.

16 min read·
Claude Agent SDK: Build Custom AI Agents
AI/ML Engineering

Claude Agent SDK: Build Custom AI Agents

Build production Claude agents in TypeScript or Python with the official Agent SDK. Tool-use loop, MCP integration, extended thinking, guardrails, and observability — end-to-end tutorial in under 45 minutes.

16 min read·
Qwen 3.5 on Apple Silicon: M3/M4 Tokens-per-Second
AI/ML Engineering

Qwen 3.5 on Apple Silicon: M3/M4 Tokens-per-Second

Qwen 3.5 hits 70-92 tok/s on M4 Max with MLX and 22 tok/s on 16 GB M4 base. Per-chip tables (M3 through M4 Ultra), MLX vs llama.cpp, thermal throttling, and when unified memory beats an RTX 4090.

15 min read·
Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro: Benchmarks
AI/ML Engineering

Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro: Benchmarks

Head-to-head benchmarks across SWE-bench Verified, GPQA Diamond, AIME, and LiveBench. Real pricing per coding task, caching economics, and context-window behavior with a clear decision matrix.

18 min read·
RAG vs Fine-Tuning vs Long Context in 2026: A Decision Guide
AI/ML Engineering

RAG vs Fine-Tuning vs Long Context in 2026: A Decision Guide

The 2026 refresh: 1M-token contexts, LoRA fine-tuning, RAG still the bread-and-butter. What each is best at, the cost math at realistic scale, hybrid patterns production uses, and why 'long context replaces RAG' got it wrong.

11 min read·
vLLM vs TGI vs Triton: LLM Inference Server Comparison
AI/ML Engineering

vLLM vs TGI vs Triton: LLM Inference Server Comparison

Production LLM serving with vLLM 0.7, TGI 3.0, and NVIDIA Triton + TensorRT-LLM. Llama 3.1 70B H100 benchmarks, FP8 KV-cache numbers, $/1M token math, and a decision framework for picking the right server per team shape.

18 min read·
Page 1 of 6Older →

Stay in the loop

New articles delivered to your inbox. No spam.