OpenTelemetry vs Datadog: Open Standard or Managed Platform?
Compare OpenTelemetry and Datadog across total cost of ownership, instrumentation, vendor lock-in, and architecture. TCO at 10, 50, and 200 services, OTel Collector pipeline config, hybrid approach, and a phased migration guide.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

The Renewal That Forced the Question
The Datadog renewal quote landed on a Tuesday: $847K/year, up from $512K. Twelve months of "we will revisit the custom metrics tier" had caught up with us. We were a 72-service, Series-B Node.js + Go platform, and every service had dutifully tagged its spans with customer_id, feature_flag, and a dozen other high-cardinality dimensions. Datadog's invoice was the bill for those tags. One line item -- "Custom Metrics" -- was $311K of the $847K.
We had a choice: negotiate the contract down (realistic ceiling, based on three vendor bake-offs: ~15% discount), or migrate to OpenTelemetry + Grafana and take a defensible position on our telemetry architecture for the next five years. We took option two. Six months later, the same observability surface area cost $71K/year in infrastructure plus one SRE's part-time attention. This guide is the decision framework we should have used two years earlier -- when the Datadog agent was "just a small cost" and every team was free-form tagging their way into a $300K annual Custom Metrics line.
OTel Is Plumbing. Datadog Is the House. You Pay for Both.
Before any comparison is useful, internalize that these two things operate at different layers. Treating them as "competitors" is a category error that drives most bad architecture decisions in this space.
OpenTelemetry is an instrumentation API, a set of SDKs, a wire protocol (OTLP), and a Collector. Full stop. It does not store data. It does not draw dashboards. It does not page you at 3 AM. It is the part of your observability stack that gets data from your services to somewhere. Swap the exporter in one config file and the same instrumentation suddenly writes to a different backend. That portability is the entire point.
Datadog is a vertically integrated SaaS that spans from the agent on your host all the way to the incident timeline your on-call engineer reads at 3 AM. Agents, ingestion, storage, indexing, dashboards, monitors, SLOs, APM, profiler, RUM, synthetics -- all under one UI, one login, one contract, one bill. That integration is why incident response feels fast in Datadog. It is also why the bill compounds.
The practical implication: you do not have to pick one layer and commit forever. You can instrument with OTel and ship to Datadog. You can instrument with dd-trace and later migrate via the OTel Collector's dual-write pipeline. The question is not "which tool" -- it is "where in the stack am I willing to be locked in, and for how long is that a good trade?" That question has very different answers at 10 services versus 200.
Architecture Comparison
The architectural differences between these two approaches affect deployment, operations, and long-term flexibility.
| Aspect | OpenTelemetry + OSS Stack | Datadog |
|---|---|---|
| Instrumentation | OTel SDKs (vendor-neutral APIs) | dd-trace libraries (proprietary) or OTel SDKs |
| Collection | OTel Collector (self-managed) | Datadog Agent (self-managed or serverless) |
| Storage | Prometheus, Tempo, Loki, ClickHouse (self-managed or cloud) | Datadog-managed (fully hosted) |
| Visualization | Grafana (self-hosted or Grafana Cloud) | Datadog dashboards (built-in) |
| Alerting | Alertmanager, Grafana Alerting | Datadog Monitors (built-in) |
| Data format | OTLP (open standard) | Proprietary + OTLP ingestion support |
| Operational burden | High -- you run the infrastructure | Low -- Datadog manages it |
Total Cost of Ownership at Three Scales
Cost is where this decision gets concrete. I've modeled TCO at three scales based on real-world deployments, including infrastructure, licensing, and engineering time to operate the stack. These numbers assume a containerized environment on AWS with average telemetry volume per service.
10-Service Startup
| Cost Component | OTel + Grafana Cloud | Datadog Pro |
|---|---|---|
| Platform/licensing | $0 (free tier covers it) | ~$690/mo (23 hosts x $15 infra + APM) |
| Infrastructure (Collector, storage) | ~$150/mo (small Collector + Grafana free tier) | $0 (Datadog-managed) |
| Engineering time (setup + maintenance) | ~40 hours initial, 4 hrs/mo ongoing | ~8 hours initial, 1 hr/mo ongoing |
| Estimated monthly TCO | ~$400-600 | ~$700-900 |
At this scale, Datadog is competitive. The engineering time savings nearly offset the licensing cost, and you get a polished experience from day one. For a startup with limited ops capacity, Datadog often wins here.
50-Service Mid-Stage Company
| Cost Component | OTel + Grafana Stack | Datadog Pro |
|---|---|---|
| Platform/licensing | ~$800/mo (Grafana Cloud Pro) | ~$5,500/mo (hosts + APM + log ingestion) |
| Infrastructure | ~$600/mo (Collector cluster, storage) | $0 |
| Engineering time | ~80 hours initial, 12 hrs/mo ongoing | ~20 hours initial, 4 hrs/mo ongoing |
| Estimated monthly TCO | ~$2,500-3,500 | ~$6,000-8,000 |
At 50 services, OTel + Grafana starts pulling ahead significantly. The engineering overhead is real but manageable for a team that has a dedicated platform or SRE function. The cost delta funds a significant portion of an SRE salary.
200-Service Enterprise
| Cost Component | OTel + Grafana Stack | Datadog Enterprise |
|---|---|---|
| Platform/licensing | ~$4,000/mo (Grafana Cloud Advanced) | ~$50,000+/mo (hosts + APM + logs + custom metrics) |
| Infrastructure | ~$3,000/mo (HA Collector, object storage) | $0 |
| Engineering time | 1-2 dedicated SREs | 0.5 SRE for agent management |
| Estimated monthly TCO | ~$12,000-18,000 | ~$50,000-80,000 |
At enterprise scale, the gap is dramatic. Datadog's per-host pricing model compounds relentlessly. Custom metrics pricing alone can add five figures monthly. This is where large organizations either negotiate aggressively with Datadog or migrate to an OTel-based stack.
Instrumentation: OTel SDKs vs. dd-trace
Both approaches offer auto-instrumentation for common frameworks and manual instrumentation APIs for custom business logic. Here is how they compare in a Node.js application.
OpenTelemetry Instrumentation
// tracing.ts -- loaded before application code
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
serviceName: 'payment-service',
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
});
sdk.start();
Datadog dd-trace Instrumentation
// tracing.ts -- loaded before application code
import tracer from 'dd-trace';
tracer.init({
service: 'payment-service',
env: 'production',
version: '1.4.2',
logInjection: true,
runtimeMetrics: true,
profiling: true,
});
The Datadog setup is undeniably simpler. Fewer packages, less configuration, and features like profiling and runtime metrics are built in. OTel requires more explicit configuration but gives you portability -- that same instrumentation code works with Jaeger, Tempo, Honeycomb, or any OTLP-compatible backend.
The OTel Collector: Your Telemetry Pipeline
The OpenTelemetry Collector is the architectural component that makes OTel powerful. It sits between your services and your backends, acting as a vendor-neutral telemetry router that can process, filter, sample, enrich, and fan out data to multiple destinations simultaneously.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 2048
memory_limiter:
check_interval: 1s
limit_mib: 1024
tail_sampling:
decision_wait: 10s
policies:
- name: errors-always
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 2000 }
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 5 }
attributes:
actions:
- key: deployment.environment
value: production
action: upsert
exporters:
otlphttp/grafana:
endpoint: https://otlp-gateway-prod-us-east.grafana.net/otlp
headers:
Authorization: "Basic ${GRAFANA_OTLP_TOKEN}"
datadog:
api:
key: ${DD_API_KEY}
site: datadoghq.com
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch, attributes]
exporters: [otlphttp/grafana, datadog]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp/grafana]
This configuration demonstrates the Collector's killer feature: multi-destination export. You can send traces to both Grafana Cloud and Datadog simultaneously, making migration incremental rather than all-or-nothing. The tail sampling processor keeps 100% of errors and slow traces while sampling 5% of routine traffic, drastically reducing storage costs.
Grafana Flexibility vs. Datadog Polish
On the visualization side, the tradeoff is customizability versus out-of-the-box experience.
| Feature | Grafana | Datadog |
|---|---|---|
| Dashboard building | Extremely flexible, any data source | Polished templates, guided setup |
| Data sources | 100+ plugins (Prometheus, Loki, Tempo, Postgres, etc.) | Datadog metrics/traces/logs only |
| Alerting | Multi-source, Alertmanager or Grafana-native | Integrated monitors with anomaly detection |
| Trace-to-log correlation | Manual config (Tempo + Loki linking) | Automatic, zero config |
| APM service map | Requires Tempo + service graph connector | Built-in, auto-generated |
| Learning curve | Steeper -- PromQL, LogQL, TraceQL | Lower -- unified query interface |
| Notebooks/collaboration | Basic annotations | Full notebooks, incident timelines |
Datadog's strength is correlation. Click a spike on a metric dashboard, pivot to traces for that time window, drill into a specific trace, jump to the associated logs -- all without leaving the platform. Grafana can do this too with Tempo, Loki, and Prometheus, but the linking requires configuration and the experience is less smooth. For teams that value speed-to-insight during incidents, Datadog's polish is real.
Vendor Lock-In: The Hidden Cost
Vendor lock-in is the argument most cited for OTel, and it deserves a nuanced discussion rather than hand-waving.
Datadog lock-in is real and multifaceted:
- Instrumentation lock-in: dd-trace libraries use proprietary span formats and tags. Migrating means re-instrumenting every service.
- Dashboard lock-in: Datadog dashboards, monitors, and SLOs are defined in Datadog's proprietary format. They cannot be exported to Grafana or any other tool.
- Custom metrics lock-in: DogStatsD metric naming conventions differ from Prometheus/OTel conventions. Migration requires renaming and re-alerting.
- Workflow lock-in: Incident management, runbooks, and on-call workflows built in Datadog must be rebuilt elsewhere.
OTel avoids instrumentation lock-in by design:
- OTLP is an open standard supported by every major backend.
- Switching from Tempo to Honeycomb means changing one exporter config in the Collector.
- Your application code never changes when you swap backends.
- Grafana dashboards can be version-controlled as JSON and migrated between instances.
That said, OTel does not eliminate all lock-in. If you build heavily on Grafana Cloud's specific features (Adaptive Metrics, for example), you carry some platform dependency. The difference is that the instrumentation layer -- the part that touches every service -- remains portable.
The Hybrid Approach: OTel Instrumentation with Datadog Backend
You don't have to choose one or the other at every layer. The most pragmatic approach for many teams is a hybrid: instrument with OpenTelemetry, send to Datadog.
# Hybrid: OTel Collector sending to Datadog
exporters:
datadog:
api:
key: ${DD_API_KEY}
site: datadoghq.com
traces:
span_name_as_resource_name: true
metrics:
resource_attributes_as_tags: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [datadog]
This gives you Datadog's dashboards, APM, and alerting while keeping your instrumentation vendor-neutral. If you later decide to move off Datadog, you change the Collector's exporter -- not your application code. Datadog supports OTLP ingestion natively, so compatibility is solid.
Caveats of the hybrid approach:
- Some Datadog-specific features (Continuous Profiler, Error Tracking deep integration) work better with dd-trace.
- OTel metric naming conventions may not map perfectly to Datadog's expectations. Test your dashboards.
- You still pay Datadog's pricing -- the hybrid approach saves you from instrumentation lock-in, not from licensing costs.
Migration Guide: Datadog to OTel + Grafana
If you're moving off Datadog, here is the phased approach that minimizes risk:
- Phase 1 -- Deploy the OTel Collector alongside the Datadog Agent. Configure it to receive OTLP and export to both Datadog and your target backend (e.g., Grafana Cloud). This lets you validate data parity without disrupting existing dashboards.
- Phase 2 -- Migrate instrumentation service by service. Replace dd-trace with OTel SDKs in non-critical services first. Verify traces and metrics appear correctly in both backends. Use feature flags to toggle between instrumentation libraries during the transition.
- Phase 3 -- Rebuild dashboards and alerts. Recreate your most critical Datadog dashboards in Grafana. Start with SLO dashboards and on-call views. This is the most time-consuming step -- budget 2-4 weeks for a 50-service deployment.
- Phase 4 -- Cut over and decommission. Once all services emit OTel telemetry and all critical dashboards exist in Grafana, remove the Datadog exporter from the Collector and cancel the contract. Keep Datadog read-only access for 30 days to handle any gaps.
Migration reality check: Plan for 3-6 months for a 50+ service deployment. The instrumentation swap is the easy part. Rebuilding institutional knowledge embedded in Datadog dashboards, monitors, and runbooks takes longer than anyone estimates. Do not underestimate phase 3.
Decision Framework
Use this framework to decide which approach fits your team:
| Choose | When |
|---|---|
| Datadog | Small team (fewer than 5 engineers), fewer than 20 services, no dedicated SRE, need observability fast, budget is not the primary constraint |
| OTel + Grafana | Platform/SRE team available, 30+ services, cost-sensitive, multi-cloud or hybrid environments, vendor independence is a strategic priority |
| Hybrid (OTel + Datadog) | Currently on Datadog and want to reduce future lock-in, planning eventual migration, need Datadog features today but want portable instrumentation |
Failure Modes: What Actually Breaks in Production
Datadog custom-metrics bill shock. Every statsd tag combination is a billable custom metric. A team I worked with tagged their request counter with customer_id (140K customers) and route (310 routes) -- that is 43.4M unique series, at $0.05/series/mo = $2.17M/year for a single counter. Fix: cap cardinality at ingest, use exemplars or exemplar sampling for high-cardinality drill-down, alert on metrics_monthly_count from Datadog's own usage API before finance does.
OTel Collector OOMs on startup. The default Collector config has no memory limiter. Under burst load it pulls in more spans than it can buffer and gets OOM-killed by the kernel, which loses in-flight telemetry right when you need it most. Fix: always configure the memory_limiter processor first in every pipeline and set it to 75% of the container's memory limit.
dd-trace auto-instrumentation double-reports. Enable dd-trace in a Node.js app that also has OTel auto-instrumentation, and you get two spans per HTTP request. This doubles your APM bill and makes your traces look like the app is doing everything twice. Fix: disable one of them. Put a test in CI that asserts only one instrumentation library is loaded.
Tail sampling drops the traces you actually need. Naive tail sampling ("keep 10% of traces") throws away 90% of the trace that has the one slow database query you are hunting. Fix: policy-based sampling with error-always + latency-threshold + baseline-probabilistic, as in the Collector config above. Never deploy probabilistic-only sampling in production.
Datadog log quota blown by a debug flag. Someone flips LOG_LEVEL=debug in production "for five minutes" to investigate something. They forget. 36 hours later, the monthly log quota is gone and you are paying $0.10/GB overage on 400 GB/day. Fix: make LOG_LEVEL part of the feature-flag system with an automatic 30-minute expiry, and alert on log-volume anomalies.
Frequently Asked Questions
Can I use OpenTelemetry with Datadog?
Yes. Datadog natively supports OTLP ingestion for traces and metrics. You instrument with OTel SDKs, send data to the OTel Collector, and export to Datadog's OTLP endpoint. This gives you vendor-neutral instrumentation while using Datadog's platform. Some Datadog-specific features like Continuous Profiler work best with dd-trace, but core APM, dashboards, and alerting work well with OTel-sourced data.
Is OpenTelemetry really free?
The software is free and open source. The infrastructure to run it is not. You need compute for the OTel Collector (typically 2-4 vCPUs and 4-8 GB RAM for a mid-size deployment), a storage backend (Prometheus, Tempo, Loki -- either self-hosted or via Grafana Cloud), and engineering time to operate the pipeline. For small deployments, Grafana Cloud's free tier covers basic needs. At scale, the infrastructure and engineering costs are real but consistently lower than Datadog licensing.
What does Datadog cost for 100 hosts?
Datadog Pro pricing for 100 hosts with Infrastructure Monitoring ($15/host), APM ($31/host), and Log Management (estimated 100 GB/day at $0.10/GB) runs approximately $15,000-20,000 per month before custom metrics, Synthetics, or other add-ons. Enterprise pricing includes additional features at higher per-host rates. Custom metric pricing ($0.05 per custom metric per host) is the cost that surprises most teams. Negotiate annual contracts for 20-40% discounts on list price.
How does tail sampling in the OTel Collector reduce costs?
Tail sampling evaluates complete traces before deciding whether to store them. You configure policies to keep 100% of error traces and slow traces (which you always want for debugging) while sampling a small percentage (e.g., 5-10%) of successful, fast traces. This typically reduces trace storage volume by 80-95% with minimal loss of debugging capability. The OTel Collector's tail_sampling processor handles this natively. Datadog offers similar ingestion controls, but since you pay per indexed span, the savings mechanism differs.
How long does it take to migrate from Datadog to OpenTelemetry?
For a 10-service deployment, expect 4-6 weeks. For 50+ services, plan 3-6 months. The instrumentation swap (replacing dd-trace with OTel SDKs) is straightforward -- typically a day per service. The bottleneck is rebuilding dashboards, alerts, SLOs, and operational runbooks in the new stack. Parallel-run both systems during migration to validate data parity. The OTel Collector's multi-exporter capability makes this dual-write pattern easy.
Does Datadog support OpenTelemetry natively?
Datadog added native OTLP ingestion in 2023 and has steadily improved compatibility. The Datadog Agent can act as an OTLP receiver, and Datadog's backend maps OTel spans and metrics to its internal data model. However, some translations are imperfect -- OTel resource attributes may not map cleanly to Datadog tags, and metric naming conventions differ. Test your specific use cases. The Datadog exporter in the OTel Collector (contrib distribution) provides the best compatibility.
When should I avoid OpenTelemetry?
Avoid building an OTel-based stack if you have no platform engineering capacity, fewer than 10 services, or need production-ready observability within days rather than weeks. OTel's flexibility comes with operational complexity -- running the Collector at high availability, managing storage backends, configuring Grafana datasources, and troubleshooting pipeline issues all require engineering investment. If your team's strength is product development and you have budget for Datadog, the managed platform may be the right tradeoff.
Conclusion
OpenTelemetry and Datadog are not interchangeable alternatives -- they operate at different layers of the observability stack. OTel is an instrumentation standard and telemetry pipeline. Datadog is a complete managed platform. The right choice depends on your team size, service count, budget constraints, and how much operational complexity you're willing to absorb.
For most teams, the answer evolves over time. Start with Datadog if you need observability fast and have the budget. Instrument with OTel from day one if you can, using the hybrid approach to keep your options open. As you grow past 30-50 services, reassess -- the cost gap between Datadog and an OTel-based stack widens with every host you add, and that savings compounds month after month.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
AIOps in 2026: AI-Driven Monitoring & Incident Response
AIOps in 2026 cuts alert noise 70-95% and Sev-2 MTTR 20-40% when layered on disciplined alerting. Landscape review of Dynatrace Davis, Datadog Watchdog, PagerDuty AIOps, BigPanda, and 6 more — with honest failure modes.
16 min read
ObservabilityBest Log Management Tools (2026): Splunk vs Datadog Logs vs Loki vs SigNoz
Benchmarked comparison of Splunk, Datadog Logs, Grafana Loki, and SigNoz on a 1.2 TB/day pipeline. Real 2026 pricing, query performance, and a cost-per-GB decision matrix.
15 min read
ObservabilityGrafana Cloud vs Datadog vs Honeycomb (2026): Modern Observability Compared
Three observability philosophies compared at small, medium, and large scale: Grafana Cloud (OSS LGTM stack), Datadog (all-in-one SaaS), Honeycomb (event-based, debug-first). Real 2026 pricing, cardinality traps, and decision matrix for greenfield platform picks.
15 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.