AI Observability: Monitor & Debug LLM Apps

$47,000 in Nine Hours: The Invoice That Forces Observability

A mid-size SaaS team I advised last year pushed a feature on a Tuesday afternoon. A retry loop in the "summarise this document" endpoint had an off-by-one in its exponential backoff -- on every failed Anthropic call, instead of waiting and retrying once, it fanned out to four parallel retries, each of which could fan out to four more. By the time the on-call noticed the queue depth on Wednesday morning, the service had made 2.3 million unintended LLM calls. Nine hours of runaway traffic. One feature. One Anthropic invoice for $47,812. Zero pages fired during the incident because every individual call returned 200 OK in acceptable latency. The production dashboards were green the entire time.

That invoice is the one-paragraph argument for AI observability. Traditional APM gets you maybe 30% of the way to debugging an LLM application -- it catches 5xx rates, timeouts, and CPU spikes. It does not catch the four failure modes that actually hurt: hallucinations that return 200 OK with wrong answers, cost blowouts from tokens-per-request drifting upward, quality regressions after a prompt change that no error log will ever surface, and latency distributions that depend on output length rather than queue depth. Those are the bugs that never trip an alert and never appear in your existing dashboards, and they are the reason a separate discipline exists.

After running LLM workloads that served millions of requests a day, I converged on a small set of signals that would have caught the $47k incident -- and every other bad class of LLM outage -- in minutes rather than hours. They are not exotic. They are five categories of telemetry that any team shipping production AI should have on day one: input/output logs, token-and-cost accounting, latency decomposition, quality evaluation, and trace-level visibility across the full RAG pipeline. The rest of this guide is each of those pillars, what to instrument, which tools do it well, and what it costs.

The Five Pillars of LLM Observability

Pillar 1: Input/Output Logging

Log every prompt and response. This is non-negotiable. When a user reports a bad answer, you need to see exactly what went in and what came out. But there's a catch -- prompts often contain user data, and responses might contain PII the model generated.

import hashlib
import re

def scrub_pii(text: str) -> str:
    """Remove common PII patterns before logging."""
    # Email addresses
    text = re.sub(r'[\w.-]+@[\w.-]+\.\w+', '[EMAIL]', text)
    # Phone numbers (US format)
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
    # SSN
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
    # Credit card numbers
    text = re.sub(r'\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b', '[CC]', text)
    return text

def log_llm_interaction(prompt: str, response: str, metadata: dict):
    """Log with PII scrubbing and content hashing."""
    scrubbed_prompt = scrub_pii(prompt)
    scrubbed_response = scrub_pii(response)
    content_hash = hashlib.sha256(prompt.encode()).hexdigest()[:12]

    logger.info("llm_interaction", extra={
        "prompt_hash": content_hash,
        "prompt": scrubbed_prompt,
        "response": scrubbed_response,
        "model": metadata.get("model"),
        "tokens_in": metadata.get("prompt_tokens"),
        "tokens_out": metadata.get("completion_tokens"),
    })

Watch out: Logging raw prompts and responses can create compliance issues under GDPR, HIPAA, and CCPA. Implement PII scrubbing before logging, not after. Establish a retention policy for LLM logs -- 30-90 days is typical. Some teams log content hashes instead of raw text and only retrieve the original content when debugging a specific issue.

Pillar 2: Token Usage and Cost Tracking

LLM costs are directly proportional to token consumption, and token consumption is highly variable. A single prompt with a large context window can cost 100x more than a simple question.

from prometheus_client import Counter, Histogram

TOKEN_USAGE = Counter(
    "llm_tokens_total",
    "Total tokens consumed",
    ["model", "direction", "feature"]  # direction: input/output
)
REQUEST_COST = Histogram(
    "llm_request_cost_dollars",
    "Cost per LLM request in dollars",
    ["model", "feature"],
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

# Per-model pricing (dollars per 1M tokens)
PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "claude-sonnet-4": {"input": 3.00, "output": 15.00},
}

def track_usage(model: str, input_tokens: int, output_tokens: int, feature: str):
    TOKEN_USAGE.labels(model=model, direction="input", feature=feature).inc(input_tokens)
    TOKEN_USAGE.labels(model=model, direction="output", feature=feature).inc(output_tokens)

    pricing = PRICING.get(model, {"input": 0, "output": 0})
    cost = (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000
    REQUEST_COST.labels(model=model, feature=feature).observe(cost)

Pro tip: Track costs per feature, not just per model. Knowing that your "document summarization" feature costs $2,400/month while "chat" costs $300/month lets you optimize where it matters. Set per-feature cost budgets with alerts. I've seen a single runaway prompt loop generate a $500 bill in an hour.

Pillar 3: Latency Decomposition

LLM latency isn't a single number. Break it down into components that each tell a different story:

Metric	What It Measures	Why It Matters
TTFT (Time to First Token)	Time from request to first token streamed	Perceived responsiveness in streaming UIs
TPS (Tokens Per Second)	Generation speed after first token	Reading speed for streaming, throughput for batch
Total Latency	Full request-response time	SLA compliance, user experience for non-streaming
Queue Time	Time waiting before inference starts	Capacity planning signal

TTFT is especially important for streaming applications. A 200ms TTFT with 50 TPS feels snappy. A 3-second TTFT with 100 TPS feels sluggish even though total latency might be lower. These metrics require different optimizations -- TTFT improves with smaller models or speculative decoding, while TPS improves with batching and KV-cache optimization.

Pillar 4: Quality Metrics

Direct answer: The three most important quality metrics for LLM applications are faithfulness (does the output match provided context?), groundedness (are claims supported by source data?), and toxicity (does the output contain harmful content?). These are typically evaluated using a separate LLM-as-judge or specialized classifiers.

Metric	What It Catches	How to Measure	Typical Threshold
Faithfulness	Hallucinated facts not in context	LLM-as-judge, NLI models	> 0.85
Groundedness	Claims without source attribution	Citation extraction + verification	> 0.80
Toxicity	Harmful, biased, or inappropriate content	Perspective API, custom classifiers	< 0.05
Relevance	Off-topic or unhelpful responses	Embedding similarity to query	> 0.70
Coherence	Contradictions, logical errors	LLM-as-judge	> 0.80

The challenge is that quality metrics are expensive to compute -- running an LLM-as-judge on every response doubles your LLM costs. Most teams sample: evaluate 5-10% of production traffic, with higher sampling rates on new features or after model changes.

Pillar 5: Trace-Level Visibility

LLM applications are rarely a single model call. A typical RAG pipeline involves embedding the query, searching a vector database, reranking results, building a prompt, calling the LLM, and possibly calling a second LLM for quality checking. Each step can fail or degrade independently. You need distributed tracing that captures the full chain.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

tracer = trace.get_tracer("llm-app")

async def rag_pipeline(query: str):
    with tracer.start_as_current_span("rag_pipeline") as span:
        span.set_attribute("query", query)

        with tracer.start_as_current_span("embed_query"):
            query_embedding = await embed(query)

        with tracer.start_as_current_span("vector_search") as search_span:
            results = await vector_db.search(query_embedding, top_k=5)
            search_span.set_attribute("results_count", len(results))
            search_span.set_attribute("top_score", results[0].score)

        with tracer.start_as_current_span("rerank"):
            reranked = await reranker.rerank(query, results)

        with tracer.start_as_current_span("llm_generate") as llm_span:
            response = await llm.generate(build_prompt(query, reranked))
            llm_span.set_attribute("tokens_in", response.usage.prompt_tokens)
            llm_span.set_attribute("tokens_out", response.usage.completion_tokens)
            llm_span.set_attribute("model", response.model)

        return response.content

Tool Comparison: LangSmith vs Langfuse vs Arize

Feature	LangSmith	Langfuse	Arize Phoenix
Open Source	No	Yes (self-host)	Yes (Phoenix OSS)
Best For	LangChain users	Framework-agnostic teams	Data science teams, drift detection
Tracing	Excellent	Excellent	Good
Evaluation	Built-in, LLM-as-judge	Custom evaluators, scoring	Strong drift/embedding analysis
Cost Tracking	Yes	Yes	Limited
Self-hosting	No	Yes (Docker)	Yes (Phoenix)
Pricing	Free tier, then $39+/mo	Free self-hosted, cloud from $0	Free OSS, cloud from $0
Framework Lock-in	Works best with LangChain	None	None

Pro tip: If you're already using LangChain, LangSmith is the path of least resistance -- tracing is built in. If you want to avoid vendor lock-in or need self-hosting for compliance, Langfuse is the strongest option. For teams that care more about embedding drift and model performance analysis than tracing, Arize Phoenix fills a unique niche.

DIY Observability with OpenTelemetry

If you want to own your observability stack, OpenTelemetry provides the foundation. The opentelemetry-instrumentation-openai and opentelemetry-instrumentation-anthropic packages auto-instrument LLM API calls with zero code changes.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Set up OpenTelemetry with OTLP export (works with Jaeger, Grafana Tempo, etc.)
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument OpenAI calls
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument()

# Now every OpenAI call automatically generates spans with:
# - Model name, temperature, max_tokens
# - Input/output token counts
# - Latency breakdown
# - Request/response content (configurable)

Evaluation Datasets and Regression Testing

Direct answer: Build an evaluation dataset of 50-200 question-answer pairs that represent your application's critical use cases. Run this dataset against your LLM pipeline after every change -- model update, prompt change, RAG configuration tweak -- and compare scores to your baseline. This is your regression test suite for AI quality.

Collect golden examples -- curate input-output pairs that represent ideal behavior across your key use cases
Define metrics per example -- what does "correct" mean for each case? Exact match? Semantic similarity above 0.9? Contains specific keywords?
Automate evaluation -- run the dataset through your pipeline on every PR or deployment, just like unit tests
Track scores over time -- a 5% drop in faithfulness after a prompt change is a regression, even if no individual test "fails"
Expand continuously -- add production failures to the dataset as you discover them. This is your growing regression suite

Cost of AI Observability Tools

Approach	Monthly Cost (100K requests)	Setup Effort	Completeness
LangSmith (Pro)	$39-400	Low (especially with LangChain)	High
Langfuse Cloud	$0-59	Low-Medium	High
Langfuse Self-hosted	$0 + infra costs	Medium	High
Arize Phoenix (OSS)	$0 + infra costs	Medium	Medium-High
DIY (OTel + Grafana)	$0-50 + infra costs	High	Customizable

Frequently Asked Questions

How is AI observability different from traditional application monitoring?

Traditional monitoring tracks system health -- uptime, latency, error rates. AI observability adds content-level monitoring -- what the model said, whether it was accurate, how much it cost, and whether quality is drifting over time. A model can return 200 OK on every request while giving increasingly wrong answers. Traditional monitoring won't catch that.

Do I need to log every LLM request and response?

For production debugging, yes -- you need the ability to inspect any specific interaction. But you can tier your logging: store full content for 30 days, then keep only metadata (tokens, latency, cost, quality scores) for longer-term analysis. Sampling quality evaluations at 5-10% keeps costs manageable while still catching systematic issues.

What's the most important metric to track for LLM applications?

Cost per request, broken down by feature. It's the metric that surprises teams the most and has the most direct business impact. After that, faithfulness (for RAG applications) or user satisfaction scores (for chat applications). Latency is important but rarely the first thing that breaks.

How do I detect hallucinations in production?

Use a lightweight NLI (Natural Language Inference) model or LLM-as-judge to check whether the response is entailed by the provided context. Run this on a sample of production traffic. Flag responses with faithfulness scores below your threshold for human review. Over time, build a dataset of confirmed hallucinations to improve your detection.

Can I use existing APM tools like Datadog or New Relic for LLM observability?

Partially. Datadog and New Relic both offer LLM-specific integrations now that capture token usage and basic tracing. They're decent for operational metrics but weaker on quality evaluation and content-level analysis. Many teams use a traditional APM tool for system metrics alongside a specialized tool like Langfuse or LangSmith for LLM-specific observability.

What is an evaluation dataset and how large does it need to be?

An evaluation dataset is a curated set of inputs paired with expected outputs that you use to regression-test your LLM pipeline. Start with 50 examples covering your critical use cases. Expand to 200+ as you discover edge cases in production. Quality matters more than size -- 50 carefully curated examples beat 500 auto-generated ones.

How do I handle PII in LLM logs?

Implement PII scrubbing as a pre-processing step before any logging or storage. Use regex patterns for structured PII (emails, phone numbers, SSNs) and NER models for unstructured PII (names, addresses). Hash original content for lookup capability. Establish retention policies and ensure your logging pipeline is compliant with GDPR, HIPAA, or CCPA as applicable.

Observe First, Optimize Second

The single most impactful thing you can do for your LLM application is instrument it before you optimize it. Add input/output logging, token tracking, and cost attribution on day one. Add quality evaluation sampling within the first week. Build your evaluation dataset continuously from production data. Every optimization you make afterward will be informed by real data instead of guesswork. The teams I've seen succeed with LLM applications in production aren't the ones with the fanciest models -- they're the ones that can see what their system is actually doing.

AI Observability: How to Monitor and Debug LLM Applications