AI/ML Engineering

Practical AI and machine learning engineering. LLM inference, tokenization, RAG pipelines, model deployment, vector databases, and the infrastructure behind modern AI applications.

54 articles

AI/ML Engineering

LLM Latency: TTFT, ITL, and Why End-User Latency Isn't What You Think

LLM latency decomposes into TTFT (time to first token, 300-1500ms), ITL (inter-token, 10-30ms), and total time. Each has different causes and fixes. Why streaming dominates UX, when Cerebras/Groq beat Claude on speed, and the optimization playbook.

11 min read·Apr 20, 2026

AI/ML Engineering

Self-Hosting LLMs from India: Providers, Latency & INR Pricing (2026)

A practical comparison of self-hosting LLMs on Indian GPU clouds including E2E Networks, Tata TIR, and Yotta Shakti Cloud, with INR pricing inclusive of 18% GST, latency tests from Mumbai, Bangalore, Chennai, and Delhi, and DPDP Act 2023 compliance notes.

15 min read·Apr 14, 2026

AI/ML Engineering

Qwen 3 vs Qwen 3.5: What Changed & Should You Upgrade

Qwen 3.5 wins on long context, code, and agentic math (AIME +25.8 at 72B) — but the 72B license shifted from Apache 2.0 to a community license and LoRA adapters do not port. Full architecture, benchmark, and migration breakdown.

15 min read·Apr 14, 2026

AI/ML Engineering

Qwen 3.5 VRAM Requirements: Every Model Size & Quantization

Full VRAM matrix for every Qwen 3.5 model from 0.5B to 397B across 8 quantization levels. GPU tier picks, CPU/RAM fallback, llama.cpp and vLLM launch flags.

16 min read·Apr 14, 2026

AI/ML Engineering

Claude Agent SDK: Build Custom AI Agents

Build production Claude agents in TypeScript or Python with the official Agent SDK. Tool-use loop, MCP integration, extended thinking, guardrails, and observability — end-to-end tutorial in under 45 minutes.

16 min read·Apr 11, 2026

AI/ML Engineering

Qwen 3.5 on Apple Silicon: M3/M4 Tokens-per-Second

Qwen 3.5 hits 70-92 tok/s on M4 Max with MLX and 22 tok/s on 16 GB M4 base. Per-chip tables (M3 through M4 Ultra), MLX vs llama.cpp, thermal throttling, and when unified memory beats an RTX 4090.

15 min read·Apr 11, 2026

AI/ML Engineering

Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro: Benchmarks

Head-to-head benchmarks across SWE-bench Verified, GPQA Diamond, AIME, and LiveBench. Real pricing per coding task, caching economics, and context-window behavior with a clear decision matrix.

18 min read·Apr 11, 2026

AI/ML Engineering

RAG vs Fine-Tuning vs Long Context in 2026: A Decision Guide

The 2026 refresh: 1M-token contexts, LoRA fine-tuning, RAG still the bread-and-butter. What each is best at, the cost math at realistic scale, hybrid patterns production uses, and why 'long context replaces RAG' got it wrong.

11 min read·Apr 11, 2026

AI/ML Engineering

vLLM vs TGI vs Triton: LLM Inference Server Comparison

Production LLM serving with vLLM 0.7, TGI 3.0, and NVIDIA Triton + TensorRT-LLM. Llama 3.1 70B H100 benchmarks, FP8 KV-cache numbers, $/1M token math, and a decision framework for picking the right server per team shape.

18 min read·Apr 8, 2026

Page 1 of 6Older →

LLM Latency: TTFT, ITL, and Why End-User Latency Isn't What You Think

Self-Hosting LLMs from India: Providers, Latency & INR Pricing (2026)

Qwen 3 vs Qwen 3.5: What Changed & Should You Upgrade

Qwen 3.5 VRAM Requirements: Every Model Size & Quantization

Claude Agent SDK: Build Custom AI Agents

Qwen 3.5 on Apple Silicon: M3/M4 Tokens-per-Second

Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro: Benchmarks

RAG vs Fine-Tuning vs Long Context in 2026: A Decision Guide

vLLM vs TGI vs Triton: LLM Inference Server Comparison

Stay in the loop