AI/ML Engineering

AI Observability: How to Monitor and Debug LLM Applications

A practical guide to monitoring LLM applications in production -- input/output logging, cost tracking, quality metrics, and a comparison of LangSmith, Langfuse, and Arize.

A
Abhishek Patel10 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

AI Observability: How to Monitor and Debug LLM Applications
AI Observability: How to Monitor and Debug LLM Applications

Your LLM App Is a Black Box -- Here's How to Fix That

AI observability is the practice of monitoring, debugging, and understanding the behavior of LLM-powered applications in production. Traditional observability -- logs, metrics, traces -- gets you maybe 30% of the way there. LLM applications introduce entirely new failure modes: hallucinations that return 200 OK, cost spikes from runaway token usage, quality regressions that no error log will catch, and latency patterns that depend on output length rather than system load.

I've operated LLM applications serving millions of requests, and the hardest bugs I've encountered never triggered an alert. The model just started giving worse answers. Without LLM-specific observability, you're flying blind -- shipping a system where the most important failure mode is invisible to your monitoring stack.

What Is AI Observability?

Definition: AI observability is a set of practices and tools for monitoring, debugging, and evaluating AI and LLM applications in production. It extends traditional observability (logs, metrics, traces) with LLM-specific signals: input/output content logging, token usage tracking, quality metrics like faithfulness and groundedness, cost attribution, and latency breakdowns.

Standard APM tools tell you that your endpoint returned in 800ms. AI observability tells you why it returned that slowly (the model generated 2000 tokens), whether the response was actually good (faithfulness score: 0.3 -- terrible), and how much that request cost ($0.04 -- 10x your budget).

The Five Pillars of LLM Observability

Pillar 1: Input/Output Logging

Log every prompt and response. This is non-negotiable. When a user reports a bad answer, you need to see exactly what went in and what came out. But there's a catch -- prompts often contain user data, and responses might contain PII the model generated.

import hashlib
import re

def scrub_pii(text: str) -> str:
    """Remove common PII patterns before logging."""
    # Email addresses
    text = re.sub(r'[\w.-]+@[\w.-]+\.\w+', '[EMAIL]', text)
    # Phone numbers (US format)
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
    # SSN
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
    # Credit card numbers
    text = re.sub(r'\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b', '[CC]', text)
    return text

def log_llm_interaction(prompt: str, response: str, metadata: dict):
    """Log with PII scrubbing and content hashing."""
    scrubbed_prompt = scrub_pii(prompt)
    scrubbed_response = scrub_pii(response)
    content_hash = hashlib.sha256(prompt.encode()).hexdigest()[:12]

    logger.info("llm_interaction", extra={
        "prompt_hash": content_hash,
        "prompt": scrubbed_prompt,
        "response": scrubbed_response,
        "model": metadata.get("model"),
        "tokens_in": metadata.get("prompt_tokens"),
        "tokens_out": metadata.get("completion_tokens"),
    })

Watch out: Logging raw prompts and responses can create compliance issues under GDPR, HIPAA, and CCPA. Implement PII scrubbing before logging, not after. Establish a retention policy for LLM logs -- 30-90 days is typical. Some teams log content hashes instead of raw text and only retrieve the original content when debugging a specific issue.

Pillar 2: Token Usage and Cost Tracking

LLM costs are directly proportional to token consumption, and token consumption is highly variable. A single prompt with a large context window can cost 100x more than a simple question.

from prometheus_client import Counter, Histogram

TOKEN_USAGE = Counter(
    "llm_tokens_total",
    "Total tokens consumed",
    ["model", "direction", "feature"]  # direction: input/output
)
REQUEST_COST = Histogram(
    "llm_request_cost_dollars",
    "Cost per LLM request in dollars",
    ["model", "feature"],
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

# Per-model pricing (dollars per 1M tokens)
PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "claude-sonnet-4": {"input": 3.00, "output": 15.00},
}

def track_usage(model: str, input_tokens: int, output_tokens: int, feature: str):
    TOKEN_USAGE.labels(model=model, direction="input", feature=feature).inc(input_tokens)
    TOKEN_USAGE.labels(model=model, direction="output", feature=feature).inc(output_tokens)

    pricing = PRICING.get(model, {"input": 0, "output": 0})
    cost = (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000
    REQUEST_COST.labels(model=model, feature=feature).observe(cost)

Pro tip: Track costs per feature, not just per model. Knowing that your "document summarization" feature costs $2,400/month while "chat" costs $300/month lets you optimize where it matters. Set per-feature cost budgets with alerts. I've seen a single runaway prompt loop generate a $500 bill in an hour.

Pillar 3: Latency Decomposition

LLM latency isn't a single number. Break it down into components that each tell a different story:

MetricWhat It MeasuresWhy It Matters
TTFT (Time to First Token)Time from request to first token streamedPerceived responsiveness in streaming UIs
TPS (Tokens Per Second)Generation speed after first tokenReading speed for streaming, throughput for batch
Total LatencyFull request-response timeSLA compliance, user experience for non-streaming
Queue TimeTime waiting before inference startsCapacity planning signal

TTFT is especially important for streaming applications. A 200ms TTFT with 50 TPS feels snappy. A 3-second TTFT with 100 TPS feels sluggish even though total latency might be lower. These metrics require different optimizations -- TTFT improves with smaller models or speculative decoding, while TPS improves with batching and KV-cache optimization.

Pillar 4: Quality Metrics

Direct answer: The three most important quality metrics for LLM applications are faithfulness (does the output match provided context?), groundedness (are claims supported by source data?), and toxicity (does the output contain harmful content?). These are typically evaluated using a separate LLM-as-judge or specialized classifiers.

MetricWhat It CatchesHow to MeasureTypical Threshold
FaithfulnessHallucinated facts not in contextLLM-as-judge, NLI models> 0.85
GroundednessClaims without source attributionCitation extraction + verification> 0.80
ToxicityHarmful, biased, or inappropriate contentPerspective API, custom classifiers< 0.05
RelevanceOff-topic or unhelpful responsesEmbedding similarity to query> 0.70
CoherenceContradictions, logical errorsLLM-as-judge> 0.80

The challenge is that quality metrics are expensive to compute -- running an LLM-as-judge on every response doubles your LLM costs. Most teams sample: evaluate 5-10% of production traffic, with higher sampling rates on new features or after model changes.

Pillar 5: Trace-Level Visibility

LLM applications are rarely a single model call. A typical RAG pipeline involves embedding the query, searching a vector database, reranking results, building a prompt, calling the LLM, and possibly calling a second LLM for quality checking. Each step can fail or degrade independently. You need distributed tracing that captures the full chain.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

tracer = trace.get_tracer("llm-app")

async def rag_pipeline(query: str):
    with tracer.start_as_current_span("rag_pipeline") as span:
        span.set_attribute("query", query)

        with tracer.start_as_current_span("embed_query"):
            query_embedding = await embed(query)

        with tracer.start_as_current_span("vector_search") as search_span:
            results = await vector_db.search(query_embedding, top_k=5)
            search_span.set_attribute("results_count", len(results))
            search_span.set_attribute("top_score", results[0].score)

        with tracer.start_as_current_span("rerank"):
            reranked = await reranker.rerank(query, results)

        with tracer.start_as_current_span("llm_generate") as llm_span:
            response = await llm.generate(build_prompt(query, reranked))
            llm_span.set_attribute("tokens_in", response.usage.prompt_tokens)
            llm_span.set_attribute("tokens_out", response.usage.completion_tokens)
            llm_span.set_attribute("model", response.model)

        return response.content

Tool Comparison: LangSmith vs Langfuse vs Arize

FeatureLangSmithLangfuseArize Phoenix
Open SourceNoYes (self-host)Yes (Phoenix OSS)
Best ForLangChain usersFramework-agnostic teamsData science teams, drift detection
TracingExcellentExcellentGood
EvaluationBuilt-in, LLM-as-judgeCustom evaluators, scoringStrong drift/embedding analysis
Cost TrackingYesYesLimited
Self-hostingNoYes (Docker)Yes (Phoenix)
PricingFree tier, then $39+/moFree self-hosted, cloud from $0Free OSS, cloud from $0
Framework Lock-inWorks best with LangChainNoneNone

Pro tip: If you're already using LangChain, LangSmith is the path of least resistance -- tracing is built in. If you want to avoid vendor lock-in or need self-hosting for compliance, Langfuse is the strongest option. For teams that care more about embedding drift and model performance analysis than tracing, Arize Phoenix fills a unique niche.

DIY Observability with OpenTelemetry

If you want to own your observability stack, OpenTelemetry provides the foundation. The opentelemetry-instrumentation-openai and opentelemetry-instrumentation-anthropic packages auto-instrument LLM API calls with zero code changes.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Set up OpenTelemetry with OTLP export (works with Jaeger, Grafana Tempo, etc.)
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument OpenAI calls
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument()

# Now every OpenAI call automatically generates spans with:
# - Model name, temperature, max_tokens
# - Input/output token counts
# - Latency breakdown
# - Request/response content (configurable)

Evaluation Datasets and Regression Testing

Direct answer: Build an evaluation dataset of 50-200 question-answer pairs that represent your application's critical use cases. Run this dataset against your LLM pipeline after every change -- model update, prompt change, RAG configuration tweak -- and compare scores to your baseline. This is your regression test suite for AI quality.

  1. Collect golden examples -- curate input-output pairs that represent ideal behavior across your key use cases
  2. Define metrics per example -- what does "correct" mean for each case? Exact match? Semantic similarity above 0.9? Contains specific keywords?
  3. Automate evaluation -- run the dataset through your pipeline on every PR or deployment, just like unit tests
  4. Track scores over time -- a 5% drop in faithfulness after a prompt change is a regression, even if no individual test "fails"
  5. Expand continuously -- add production failures to the dataset as you discover them. This is your growing regression suite

Cost of AI Observability Tools

ApproachMonthly Cost (100K requests)Setup EffortCompleteness
LangSmith (Pro)$39-400Low (especially with LangChain)High
Langfuse Cloud$0-59Low-MediumHigh
Langfuse Self-hosted$0 + infra costsMediumHigh
Arize Phoenix (OSS)$0 + infra costsMediumMedium-High
DIY (OTel + Grafana)$0-50 + infra costsHighCustomizable

Frequently Asked Questions

How is AI observability different from traditional application monitoring?

Traditional monitoring tracks system health -- uptime, latency, error rates. AI observability adds content-level monitoring -- what the model said, whether it was accurate, how much it cost, and whether quality is drifting over time. A model can return 200 OK on every request while giving increasingly wrong answers. Traditional monitoring won't catch that.

Do I need to log every LLM request and response?

For production debugging, yes -- you need the ability to inspect any specific interaction. But you can tier your logging: store full content for 30 days, then keep only metadata (tokens, latency, cost, quality scores) for longer-term analysis. Sampling quality evaluations at 5-10% keeps costs manageable while still catching systematic issues.

What's the most important metric to track for LLM applications?

Cost per request, broken down by feature. It's the metric that surprises teams the most and has the most direct business impact. After that, faithfulness (for RAG applications) or user satisfaction scores (for chat applications). Latency is important but rarely the first thing that breaks.

How do I detect hallucinations in production?

Use a lightweight NLI (Natural Language Inference) model or LLM-as-judge to check whether the response is entailed by the provided context. Run this on a sample of production traffic. Flag responses with faithfulness scores below your threshold for human review. Over time, build a dataset of confirmed hallucinations to improve your detection.

Can I use existing APM tools like Datadog or New Relic for LLM observability?

Partially. Datadog and New Relic both offer LLM-specific integrations now that capture token usage and basic tracing. They're decent for operational metrics but weaker on quality evaluation and content-level analysis. Many teams use a traditional APM tool for system metrics alongside a specialized tool like Langfuse or LangSmith for LLM-specific observability.

What is an evaluation dataset and how large does it need to be?

An evaluation dataset is a curated set of inputs paired with expected outputs that you use to regression-test your LLM pipeline. Start with 50 examples covering your critical use cases. Expand to 200+ as you discover edge cases in production. Quality matters more than size -- 50 carefully curated examples beat 500 auto-generated ones.

How do I handle PII in LLM logs?

Implement PII scrubbing as a pre-processing step before any logging or storage. Use regex patterns for structured PII (emails, phone numbers, SSNs) and NER models for unstructured PII (names, addresses). Hash original content for lookup capability. Establish retention policies and ensure your logging pipeline is compliant with GDPR, HIPAA, or CCPA as applicable.

Observe First, Optimize Second

The single most impactful thing you can do for your LLM application is instrument it before you optimize it. Add input/output logging, token tracking, and cost attribution on day one. Add quality evaluation sampling within the first week. Build your evaluation dataset continuously from production data. Every optimization you make afterward will be informed by real data instead of guesswork. The teams I've seen succeed with LLM applications in production aren't the ones with the fanciest models -- they're the ones that can see what their system is actually doing.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.