OpenTelemetry vs Datadog: Cost & Architecture (2026)

The Renewal That Forced the Question

The Datadog renewal quote landed on a Tuesday: $847K/year, up from $512K. Twelve months of "we will revisit the custom metrics tier" had caught up with us. We were a 72-service, Series-B Node.js + Go platform, and every service had dutifully tagged its spans with customer_id, feature_flag, and a dozen other high-cardinality dimensions. Datadog's invoice was the bill for those tags. One line item -- "Custom Metrics" -- was $311K of the $847K.

We had a choice: negotiate the contract down (realistic ceiling, based on three vendor bake-offs: ~15% discount), or migrate to OpenTelemetry + Grafana and take a defensible position on our telemetry architecture for the next five years. We took option two. Six months later, the same observability surface area cost $71K/year in infrastructure plus one SRE's part-time attention. This guide is the decision framework we should have used two years earlier -- when the Datadog agent was "just a small cost" and every team was free-form tagging their way into a $300K annual Custom Metrics line.

OTel Is Plumbing. Datadog Is the House. You Pay for Both.

Before any comparison is useful, internalize that these two things operate at different layers. Treating them as "competitors" is a category error that drives most bad architecture decisions in this space.

OpenTelemetry is an instrumentation API, a set of SDKs, a wire protocol (OTLP), and a Collector. Full stop. It does not store data. It does not draw dashboards. It does not page you at 3 AM. It is the part of your observability stack that gets data from your services to somewhere. Swap the exporter in one config file and the same instrumentation suddenly writes to a different backend. That portability is the entire point.

Datadog is a vertically integrated SaaS that spans from the agent on your host all the way to the incident timeline your on-call engineer reads at 3 AM. Agents, ingestion, storage, indexing, dashboards, monitors, SLOs, APM, profiler, RUM, synthetics -- all under one UI, one login, one contract, one bill. That integration is why incident response feels fast in Datadog. It is also why the bill compounds.

The practical implication: you do not have to pick one layer and commit forever. You can instrument with OTel and ship to Datadog. You can instrument with dd-trace and later migrate via the OTel Collector's dual-write pipeline. The question is not "which tool" -- it is "where in the stack am I willing to be locked in, and for how long is that a good trade?" That question has very different answers at 10 services versus 200.

Architecture Comparison

The architectural differences between these two approaches affect deployment, operations, and long-term flexibility.

Aspect	OpenTelemetry + OSS Stack	Datadog
Instrumentation	OTel SDKs (vendor-neutral APIs)	dd-trace libraries (proprietary) or OTel SDKs
Collection	OTel Collector (self-managed)	Datadog Agent (self-managed or serverless)
Storage	Prometheus, Tempo, Loki, ClickHouse (self-managed or cloud)	Datadog-managed (fully hosted)
Visualization	Grafana (self-hosted or Grafana Cloud)	Datadog dashboards (built-in)
Alerting	Alertmanager, Grafana Alerting	Datadog Monitors (built-in)
Data format	OTLP (open standard)	Proprietary + OTLP ingestion support
Operational burden	High -- you run the infrastructure	Low -- Datadog manages it

Total Cost of Ownership at Three Scales

Cost is where this decision gets concrete. I've modeled TCO at three scales based on real-world deployments, including infrastructure, licensing, and engineering time to operate the stack. These numbers assume a containerized environment on AWS with average telemetry volume per service.

10-Service Startup

Cost Component	OTel + Grafana Cloud	Datadog Pro
Platform/licensing	$0 (free tier covers it)	~$690/mo (23 hosts x $15 infra + APM)
Infrastructure (Collector, storage)	~$150/mo (small Collector + Grafana free tier)	$0 (Datadog-managed)
Engineering time (setup + maintenance)	~40 hours initial, 4 hrs/mo ongoing	~8 hours initial, 1 hr/mo ongoing
Estimated monthly TCO	~$400-600	~$700-900

At this scale, Datadog is competitive. The engineering time savings nearly offset the licensing cost, and you get a polished experience from day one. For a startup with limited ops capacity, Datadog often wins here.

50-Service Mid-Stage Company

Cost Component	OTel + Grafana Stack	Datadog Pro
Platform/licensing	~$800/mo (Grafana Cloud Pro)	~$5,500/mo (hosts + APM + log ingestion)
Infrastructure	~$600/mo (Collector cluster, storage)	$0
Engineering time	~80 hours initial, 12 hrs/mo ongoing	~20 hours initial, 4 hrs/mo ongoing
Estimated monthly TCO	~$2,500-3,500	~$6,000-8,000

At 50 services, OTel + Grafana starts pulling ahead significantly. The engineering overhead is real but manageable for a team that has a dedicated platform or SRE function. The cost delta funds a significant portion of an SRE salary.

200-Service Enterprise

Cost Component	OTel + Grafana Stack	Datadog Enterprise
Platform/licensing	~$4,000/mo (Grafana Cloud Advanced)	~$50,000+/mo (hosts + APM + logs + custom metrics)
Infrastructure	~$3,000/mo (HA Collector, object storage)	$0
Engineering time	1-2 dedicated SREs	0.5 SRE for agent management
Estimated monthly TCO	~$12,000-18,000	~$50,000-80,000

At enterprise scale, the gap is dramatic. Datadog's per-host pricing model compounds relentlessly. Custom metrics pricing alone can add five figures monthly. This is where large organizations either negotiate aggressively with Datadog or migrate to an OTel-based stack.

Instrumentation: OTel SDKs vs. dd-trace

Both approaches offer auto-instrumentation for common frameworks and manual instrumentation APIs for custom business logic. Here is how they compare in a Node.js application.

OpenTelemetry Instrumentation

// tracing.ts -- loaded before application code
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  serviceName: 'payment-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false },
    }),
  ],
});

sdk.start();

Datadog dd-trace Instrumentation

// tracing.ts -- loaded before application code
import tracer from 'dd-trace';

tracer.init({
  service: 'payment-service',
  env: 'production',
  version: '1.4.2',
  logInjection: true,
  runtimeMetrics: true,
  profiling: true,
});

The Datadog setup is undeniably simpler. Fewer packages, less configuration, and features like profiling and runtime metrics are built in. OTel requires more explicit configuration but gives you portability -- that same instrumentation code works with Jaeger, Tempo, Honeycomb, or any OTLP-compatible backend.

The OTel Collector: Your Telemetry Pipeline

The OpenTelemetry Collector is the architectural component that makes OTel powerful. It sits between your services and your backends, acting as a vendor-neutral telemetry router that can process, filter, sample, enrich, and fan out data to multiple destinations simultaneously.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 2048
  memory_limiter:
    check_interval: 1s
    limit_mib: 1024
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: upsert

exporters:
  otlphttp/grafana:
    endpoint: https://otlp-gateway-prod-us-east.grafana.net/otlp
    headers:
      Authorization: "Basic ${GRAFANA_OTLP_TOKEN}"
  datadog:
    api:
      key: ${DD_API_KEY}
      site: datadoghq.com

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch, attributes]
      exporters: [otlphttp/grafana, datadog]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/grafana]

This configuration demonstrates the Collector's killer feature: multi-destination export. You can send traces to both Grafana Cloud and Datadog simultaneously, making migration incremental rather than all-or-nothing. The tail sampling processor keeps 100% of errors and slow traces while sampling 5% of routine traffic, drastically reducing storage costs.

Grafana Flexibility vs. Datadog Polish

On the visualization side, the tradeoff is customizability versus out-of-the-box experience.

Feature	Grafana	Datadog
Dashboard building	Extremely flexible, any data source	Polished templates, guided setup
Data sources	100+ plugins (Prometheus, Loki, Tempo, Postgres, etc.)	Datadog metrics/traces/logs only
Alerting	Multi-source, Alertmanager or Grafana-native	Integrated monitors with anomaly detection
Trace-to-log correlation	Manual config (Tempo + Loki linking)	Automatic, zero config
APM service map	Requires Tempo + service graph connector	Built-in, auto-generated
Learning curve	Steeper -- PromQL, LogQL, TraceQL	Lower -- unified query interface
Notebooks/collaboration	Basic annotations	Full notebooks, incident timelines

Datadog's strength is correlation. Click a spike on a metric dashboard, pivot to traces for that time window, drill into a specific trace, jump to the associated logs -- all without leaving the platform. Grafana can do this too with Tempo, Loki, and Prometheus, but the linking requires configuration and the experience is less smooth. For teams that value speed-to-insight during incidents, Datadog's polish is real.

Vendor Lock-In: The Hidden Cost

Vendor lock-in is the argument most cited for OTel, and it deserves a nuanced discussion rather than hand-waving.

Datadog lock-in is real and multifaceted:

Instrumentation lock-in: dd-trace libraries use proprietary span formats and tags. Migrating means re-instrumenting every service.
Dashboard lock-in: Datadog dashboards, monitors, and SLOs are defined in Datadog's proprietary format. They cannot be exported to Grafana or any other tool.
Custom metrics lock-in: DogStatsD metric naming conventions differ from Prometheus/OTel conventions. Migration requires renaming and re-alerting.
Workflow lock-in: Incident management, runbooks, and on-call workflows built in Datadog must be rebuilt elsewhere.

OTel avoids instrumentation lock-in by design:

OTLP is an open standard supported by every major backend.
Switching from Tempo to Honeycomb means changing one exporter config in the Collector.
Your application code never changes when you swap backends.
Grafana dashboards can be version-controlled as JSON and migrated between instances.

That said, OTel does not eliminate all lock-in. If you build heavily on Grafana Cloud's specific features (Adaptive Metrics, for example), you carry some platform dependency. The difference is that the instrumentation layer -- the part that touches every service -- remains portable.

The Hybrid Approach: OTel Instrumentation with Datadog Backend

You don't have to choose one or the other at every layer. The most pragmatic approach for many teams is a hybrid: instrument with OpenTelemetry, send to Datadog.

# Hybrid: OTel Collector sending to Datadog
exporters:
  datadog:
    api:
      key: ${DD_API_KEY}
      site: datadoghq.com
    traces:
      span_name_as_resource_name: true
    metrics:
      resource_attributes_as_tags: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [datadog]

This gives you Datadog's dashboards, APM, and alerting while keeping your instrumentation vendor-neutral. If you later decide to move off Datadog, you change the Collector's exporter -- not your application code. Datadog supports OTLP ingestion natively, so compatibility is solid.

Caveats of the hybrid approach:

Some Datadog-specific features (Continuous Profiler, Error Tracking deep integration) work better with dd-trace.
OTel metric naming conventions may not map perfectly to Datadog's expectations. Test your dashboards.
You still pay Datadog's pricing -- the hybrid approach saves you from instrumentation lock-in, not from licensing costs.

Migration Guide: Datadog to OTel + Grafana

If you're moving off Datadog, here is the phased approach that minimizes risk:

Phase 1 -- Deploy the OTel Collector alongside the Datadog Agent. Configure it to receive OTLP and export to both Datadog and your target backend (e.g., Grafana Cloud). This lets you validate data parity without disrupting existing dashboards.
Phase 2 -- Migrate instrumentation service by service. Replace dd-trace with OTel SDKs in non-critical services first. Verify traces and metrics appear correctly in both backends. Use feature flags to toggle between instrumentation libraries during the transition.
Phase 3 -- Rebuild dashboards and alerts. Recreate your most critical Datadog dashboards in Grafana. Start with SLO dashboards and on-call views. This is the most time-consuming step -- budget 2-4 weeks for a 50-service deployment.
Phase 4 -- Cut over and decommission. Once all services emit OTel telemetry and all critical dashboards exist in Grafana, remove the Datadog exporter from the Collector and cancel the contract. Keep Datadog read-only access for 30 days to handle any gaps.

Migration reality check: Plan for 3-6 months for a 50+ service deployment. The instrumentation swap is the easy part. Rebuilding institutional knowledge embedded in Datadog dashboards, monitors, and runbooks takes longer than anyone estimates. Do not underestimate phase 3.

Decision Framework

Use this framework to decide which approach fits your team:

Choose	When
Datadog	Small team (fewer than 5 engineers), fewer than 20 services, no dedicated SRE, need observability fast, budget is not the primary constraint
OTel + Grafana	Platform/SRE team available, 30+ services, cost-sensitive, multi-cloud or hybrid environments, vendor independence is a strategic priority
Hybrid (OTel + Datadog)	Currently on Datadog and want to reduce future lock-in, planning eventual migration, need Datadog features today but want portable instrumentation

Failure Modes: What Actually Breaks in Production

Datadog custom-metrics bill shock. Every statsd tag combination is a billable custom metric. A team I worked with tagged their request counter with customer_id (140K customers) and route (310 routes) -- that is 43.4M unique series, at $0.05/series/mo = $2.17M/year for a single counter. Fix: cap cardinality at ingest, use exemplars or exemplar sampling for high-cardinality drill-down, alert on metrics_monthly_count from Datadog's own usage API before finance does.

OTel Collector OOMs on startup. The default Collector config has no memory limiter. Under burst load it pulls in more spans than it can buffer and gets OOM-killed by the kernel, which loses in-flight telemetry right when you need it most. Fix: always configure the memory_limiter processor first in every pipeline and set it to 75% of the container's memory limit.

dd-trace auto-instrumentation double-reports. Enable dd-trace in a Node.js app that also has OTel auto-instrumentation, and you get two spans per HTTP request. This doubles your APM bill and makes your traces look like the app is doing everything twice. Fix: disable one of them. Put a test in CI that asserts only one instrumentation library is loaded.

Tail sampling drops the traces you actually need. Naive tail sampling ("keep 10% of traces") throws away 90% of the trace that has the one slow database query you are hunting. Fix: policy-based sampling with error-always + latency-threshold + baseline-probabilistic, as in the Collector config above. Never deploy probabilistic-only sampling in production.

Datadog log quota blown by a debug flag. Someone flips LOG_LEVEL=debug in production "for five minutes" to investigate something. They forget. 36 hours later, the monthly log quota is gone and you are paying $0.10/GB overage on 400 GB/day. Fix: make LOG_LEVEL part of the feature-flag system with an automatic 30-minute expiry, and alert on log-volume anomalies.

Frequently Asked Questions

Can I use OpenTelemetry with Datadog?

Yes. Datadog natively supports OTLP ingestion for traces and metrics. You instrument with OTel SDKs, send data to the OTel Collector, and export to Datadog's OTLP endpoint. This gives you vendor-neutral instrumentation while using Datadog's platform. Some Datadog-specific features like Continuous Profiler work best with dd-trace, but core APM, dashboards, and alerting work well with OTel-sourced data.

Is OpenTelemetry really free?

The software is free and open source. The infrastructure to run it is not. You need compute for the OTel Collector (typically 2-4 vCPUs and 4-8 GB RAM for a mid-size deployment), a storage backend (Prometheus, Tempo, Loki -- either self-hosted or via Grafana Cloud), and engineering time to operate the pipeline. For small deployments, Grafana Cloud's free tier covers basic needs. At scale, the infrastructure and engineering costs are real but consistently lower than Datadog licensing.

What does Datadog cost for 100 hosts?

Datadog Pro pricing for 100 hosts with Infrastructure Monitoring ($15/host), APM ($31/host), and Log Management (estimated 100 GB/day at $0.10/GB) runs approximately $15,000-20,000 per month before custom metrics, Synthetics, or other add-ons. Enterprise pricing includes additional features at higher per-host rates. Custom metric pricing ($0.05 per custom metric per host) is the cost that surprises most teams. Negotiate annual contracts for 20-40% discounts on list price.

How does tail sampling in the OTel Collector reduce costs?

Tail sampling evaluates complete traces before deciding whether to store them. You configure policies to keep 100% of error traces and slow traces (which you always want for debugging) while sampling a small percentage (e.g., 5-10%) of successful, fast traces. This typically reduces trace storage volume by 80-95% with minimal loss of debugging capability. The OTel Collector's tail_sampling processor handles this natively. Datadog offers similar ingestion controls, but since you pay per indexed span, the savings mechanism differs.

How long does it take to migrate from Datadog to OpenTelemetry?

For a 10-service deployment, expect 4-6 weeks. For 50+ services, plan 3-6 months. The instrumentation swap (replacing dd-trace with OTel SDKs) is straightforward -- typically a day per service. The bottleneck is rebuilding dashboards, alerts, SLOs, and operational runbooks in the new stack. Parallel-run both systems during migration to validate data parity. The OTel Collector's multi-exporter capability makes this dual-write pattern easy.

Does Datadog support OpenTelemetry natively?

Datadog added native OTLP ingestion in 2023 and has steadily improved compatibility. The Datadog Agent can act as an OTLP receiver, and Datadog's backend maps OTel spans and metrics to its internal data model. However, some translations are imperfect -- OTel resource attributes may not map cleanly to Datadog tags, and metric naming conventions differ. Test your specific use cases. The Datadog exporter in the OTel Collector (contrib distribution) provides the best compatibility.

When should I avoid OpenTelemetry?

Avoid building an OTel-based stack if you have no platform engineering capacity, fewer than 10 services, or need production-ready observability within days rather than weeks. OTel's flexibility comes with operational complexity -- running the Collector at high availability, managing storage backends, configuring Grafana datasources, and troubleshooting pipeline issues all require engineering investment. If your team's strength is product development and you have budget for Datadog, the managed platform may be the right tradeoff.

Conclusion

OpenTelemetry and Datadog are not interchangeable alternatives -- they operate at different layers of the observability stack. OTel is an instrumentation standard and telemetry pipeline. Datadog is a complete managed platform. The right choice depends on your team size, service count, budget constraints, and how much operational complexity you're willing to absorb.

For most teams, the answer evolves over time. Start with Datadog if you need observability fast and have the budget. Instrument with OTel from day one if you can, using the hybrid approach to keep your options open. As you grow past 30-50 services, reassess -- the cost gap between Datadog and an OTel-based stack widens with every host you add, and that savings compounds month after month.

OpenTelemetry vs Datadog: Open Standard or Managed Platform?