Skip to content

AI/ML Engineering

Practical AI and machine learning engineering. LLM inference, tokenization, RAG pipelines, model deployment, vector databases, and the infrastructure behind modern AI applications.

54 articles

Ollama vs vLLM vs llama.cpp: LLM Inference Engines Compared
AI/ML Engineering

Ollama vs vLLM vs llama.cpp: LLM Inference Engines Compared

Benchmarks and architecture comparison of Ollama, vLLM, and llama.cpp. Tokens/sec at 7B through 70B, quantization trade-offs, concurrent throughput, VRAM requirements, and a clear decision framework for local dev, production, and edge.

13 min read·
KV Cache Quantization: When Q8 Beats FP16 (and When It Doesn't)
AI/ML Engineering

KV Cache Quantization: When Q8 Beats FP16 (and When It Doesn't)

Q8 KV cache halves VRAM with under 0.1% perplexity cost. Q4 K-cache is OK, Q4 V-cache hurts. Asymmetric Q4-K + Q8-V is the magic combo.

10 min read·
RTX 5090 for Local LLMs: 32B Models with Headroom (2026)
AI/ML Engineering

RTX 5090 for Local LLMs: 32B Models with Headroom (2026)

RTX 5090 unlocks Qwen 3.5 32B at Q5_K_M with 16K context. NVFP4 native gives 60-80% inference speedup over RTX 4090. Real benchmarks and build guide.

12 min read·
AI Observability: How to Monitor and Debug LLM Applications
AI/ML Engineering

AI Observability: How to Monitor and Debug LLM Applications

A practical guide to monitoring LLM applications in production -- input/output logging, cost tracking, quality metrics, and a comparison of LangSmith, Langfuse, and Arize.

10 min read·
Self-Hosted LLM Cost: Hardware vs Cloud GPU vs API (2026)
AI/ML Engineering

Self-Hosted LLM Cost: Hardware vs Cloud GPU vs API (2026)

Below 3M tokens/day, the API wins. 3-30M, cloud GPU wins. Above 30M sustained, hardware pays back in 18-24 months. Real 2026 numbers.

12 min read·
Deploying ML Models in Production: From Notebook to Kubernetes
AI/ML Engineering

Deploying ML Models in Production: From Notebook to Kubernetes

End-to-end guide to deploying ML models -- from ONNX export and FastAPI serving to Kubernetes GPU workloads, canary deployments, and Prometheus monitoring.

9 min read·
Qwen 3.5 vs DeepSeek V4 vs GLM-5.1: Local Coding Showdown
AI/ML Engineering

Qwen 3.5 vs DeepSeek V4 vs GLM-5.1: Local Coding Showdown

Three frontier open-weight models compared for coding in April 2026. Qwen wins on consumer GPUs, GLM-5.1 leads SWE-Bench Pro, DeepSeek V4 has 1M context.

13 min read·
Fine-Tuning vs Prompt Engineering: Choosing the Right Approach
AI/ML Engineering

Fine-Tuning vs Prompt Engineering: Choosing the Right Approach

A practical guide to choosing between prompt engineering and fine-tuning for LLMs -- techniques, costs, LoRA/QLoRA, and a decision framework for production systems.

10 min read·
Qwen 3.5 GGUF Quantization: Q4_K_M vs Q5_K_M vs Q8 Guide
AI/ML Engineering

Qwen 3.5 GGUF Quantization: Q4_K_M vs Q5_K_M vs Q8 Guide

Q5_K_M is the sweet spot for Qwen 3.5 GGUF. Full perplexity table, K-quants vs IQ-quants, NVFP4 on Blackwell, and picks by VRAM tier with framework flags.

16 min read·

Stay in the loop

New articles delivered to your inbox. No spam.