Loki vs ELK vs CloudWatch: Log Management Compared

03:47 -- "checkout p99 is 12 seconds and climbing"

The pager fires at 03:47. You roll out of bed, open the dashboard, and the checkout service's p99 has walked off the top of the chart -- 12 seconds, still climbing, error rate 4% and rising. The trace view shows 300ms spent in your own code and then a long unlabelled gap before the response goes out. Helpful. You know it is the payment call but the trace does not say why.

Without centralized logs you now have three terrible options. SSH into the eight checkout pods one at a time and kubectl logs each of them, hoping the right one has not been rotated out of the container runtime's 10 MB ring buffer. Jump onto the payment service's nodes and grep through whatever files the daemonset happened to land on. Or give up and wait for the on-call for the payment team to reply to Slack -- which at 03:47 is at least fifteen minutes, and every one of those minutes is another few hundred failed checkouts.

With centralized logs the same incident looks like this. You paste the failing order IDs into a single query box, filter by service=payments and the last ten minutes, and the response is on your screen before your coffee is ready: the Stripe webhook endpoint started returning 504s at 03:41, the circuit breaker in the payment service tripped, and checkout is now waiting for the full 10 s timeout on every call. You flip the kill switch, bypass the payment service for known-good paths, and go back to bed. Twelve minutes of alerting to root cause.

That is the gap this article is about. Picking a log backend is not really about technology preferences -- it is about how long it takes an on-call engineer to answer "what is this service doing right now?" at three in the morning. The three options that own almost all of the market in 2026 are the ELK Stack (Elasticsearch, Logstash, Kibana), Grafana Loki, and AWS CloudWatch Logs. Each one makes fundamentally different bets on cost per gigabyte, query power, and how much of your evenings you want to spend running the thing. The rest of this guide is the architecture, the pricing, and the query-language trade-offs -- with concrete numbers from production clusters rather than vendor marketing.

The ELK Stack: Powerful but Heavy

The ELK Stack -- Elasticsearch, Logstash, Kibana -- has been the default log management solution for over a decade. Elasticsearch provides full-text indexing and search. Logstash (or its lighter replacement, Filebeat) ships logs. Kibana provides visualization and dashboards.

Architecture

Applications --> Filebeat --> Logstash --> Elasticsearch --> Kibana
                   (ship)     (parse)       (index/store)    (query/visualize)

Strengths

Full-text search. Elasticsearch indexes every word in every log line. You can search for arbitrary strings, regex patterns, and complex boolean queries across terabytes of data.
Rich query language. KQL (Kibana Query Language) and the Elasticsearch Query DSL are powerful. Aggregations, histograms, and statistical queries are built in.
Mature ecosystem. Hundreds of Filebeat modules for common log formats. Logstash has parsers for everything.
Kibana dashboards. Build log-based visualizations, anomaly detection, and alerting rules.

Weaknesses

Resource hungry. Elasticsearch needs significant memory (JVM heap) and fast storage. A production cluster for moderate log volume needs at least 3 nodes with 16 GB RAM each.
Operational complexity. Shard management, index lifecycle policies, cluster upgrades, and JVM tuning require dedicated expertise.
Cost at scale. Indexing every field in every log line means storage grows fast. Expect 30-50% overhead from the inverted index on top of raw log size.

# filebeat.yml -- shipping container logs to Elasticsearch
filebeat.inputs:
  - type: container
    paths:
      - /var/log/containers/*.log
    processors:
      - add_kubernetes_metadata:
          host: ${NODE_NAME}
          matchers:
            - logs_path:
                logs_path: "/var/log/containers/"

output.elasticsearch:
  hosts: ["https://elasticsearch:9200"]
  username: "elastic"
  password: "${ES_PASSWORD}"
  indices:
    - index: "logs-production-%{+yyyy.MM.dd}"
      when.contains:
        kubernetes.namespace: "production"
    - index: "logs-staging-%{+yyyy.MM.dd}"
      when.contains:
        kubernetes.namespace: "staging"

Grafana Loki: Label-Based and Cheap

Loki takes a radically different approach. Instead of indexing log content, it indexes only labels (metadata like service name, namespace, and pod). The actual log text is stored compressed and unindexed. This makes Loki dramatically cheaper to run but means full-text search requires scanning log chunks.

Architecture

Applications --> Promtail/Alloy --> Loki --> Grafana
                   (ship + label)   (store)   (query/visualize)

Strengths

Low cost. No full-text index means storage is 10-20x cheaper than Elasticsearch for the same log volume. Loki stores compressed chunks in object storage (S3, GCS).
Kubernetes native. Promtail (or Grafana Alloy) auto-discovers pods and applies labels from Kubernetes metadata.
LogQL. Loki's query language is inspired by PromQL. If your team knows Prometheus, LogQL feels familiar.
Grafana integration. Logs live alongside metrics and traces in a single Grafana pane. Click from a metric spike to the correlated logs instantly.

Weaknesses

Limited full-text search. Searching for a specific string requires scanning all chunks matching your label selector. Broad queries over large time ranges can be slow.
Label cardinality constraints. Too many unique label values degrade performance. You can't use high-cardinality fields like request IDs as labels.
Fewer parsing features. Loki parses at query time, not at ingestion. Complex parsing (like Logstash grok patterns) needs to happen in LogQL or in the shipping agent.

# LogQL examples

# Find error logs from the order service in the last hour
{namespace="production", app="order-service"} |= "error"

# Parse JSON logs and filter by status code
{app="api-gateway"} | json | status >= 500

# Count errors per service over 5-minute windows
sum by (app) (count_over_time({namespace="production"} |= "error" [5m]))

# Extract latency from logs and compute p99
{app="api-gateway"} | json | unwrap duration_ms | quantile_over_time(0.99, [5m])

AWS CloudWatch Logs: Native but Locked In

If you're running on AWS, CloudWatch Logs is already collecting your Lambda, ECS, and EKS logs. There's zero setup for AWS services -- logs flow automatically. The question is whether CloudWatch's query capabilities and pricing work for your needs.

Strengths

Zero infrastructure. No clusters to manage. AWS handles scaling, storage, and availability.
Deep AWS integration. Lambda, ECS, EKS, API Gateway, RDS -- logs appear automatically.
CloudWatch Logs Insights. A SQL-like query language that's surprisingly capable for ad-hoc analysis.
Metric filters. Create CloudWatch metrics from log patterns without external tooling.

Weaknesses

Expensive at scale. Ingestion costs $0.50/GB and storage costs $0.03/GB/month. At 100 GB/day, you're paying $1,500/month for ingestion alone.
Limited cross-account. Querying logs across multiple AWS accounts requires extra setup (cross-account log groups or a centralized logging account).
Vendor lock-in. Your log queries, dashboards, and alerts are AWS-specific. Moving to another platform means rebuilding everything.
Slower queries. CloudWatch Logs Insights scans data on query. Complex queries over large time ranges are noticeably slower than Elasticsearch.

-- CloudWatch Logs Insights query examples

-- Find the 10 slowest requests in the last hour
fields @timestamp, @message, duration_ms
| filter duration_ms > 1000
| sort duration_ms desc
| limit 10

-- Error count by service over time
filter @message like /ERROR/
| stats count(*) as errors by bin(5m), service_name

-- Parse JSON and aggregate
parse @message '{"level":"*","service":"*","duration":*}' as level, service, duration
| filter level = "error"
| stats avg(duration), max(duration), count(*) by service

Head-to-Head Comparison

Feature	ELK Stack	Grafana Loki	CloudWatch Logs
Full-text search	Excellent (indexed)	Functional (scan-based)	Good (scan-based)
Query language	KQL / ES Query DSL	LogQL	Logs Insights (SQL-like)
Storage cost (1 TB/mo)	$150-300 (self-hosted)	$25-50 (object storage)	$530 (ingestion) + $30 (storage)
Operational burden	High (cluster management)	Medium (simpler architecture)	None (fully managed)
Kubernetes integration	Good (Filebeat)	Excellent (Promtail/Alloy)	Good (Fluent Bit)
Correlation with metrics	Limited (separate tool)	Excellent (Grafana native)	Good (CloudWatch Metrics)
Multi-cloud	Yes	Yes	AWS only
Retention flexibility	Full control (ILM policies)	Full control	Per log group (1 day to 10 years)

Log Shipping Best Practices

Regardless of which backend you choose, these practices apply universally:

Use structured JSON logging. Every log line should be a JSON object with consistent fields: timestamp, level, service, message, and trace_id. This makes parsing trivial on any backend.
Ship from stdout/stderr. Don't write log files inside containers. Write to stdout, and let your log shipper (Filebeat, Promtail, Fluent Bit) collect from the container runtime.
Add Kubernetes metadata. Namespace, pod name, container name, and node name should be attached automatically by your shipping agent. This metadata powers label-based filtering.
Set retention policies by severity. Keep ERROR and WARN logs for 90 days. Keep INFO for 30 days. Drop DEBUG in production or keep it for 7 days max. This single strategy can cut your log storage costs by 60%.
Sample verbose logs. If one service produces 80% of your log volume, sample its DEBUG logs at 10% or add rate limiting in the shipping agent.

Pro tip: Include a trace_id field in every structured log entry. This bridges logs and traces, letting you jump from a log line directly to the distributed trace that produced it. It's the single most valuable correlation field you can add.

How to Choose

The decision usually comes down to three factors:

Existing stack. Already using Grafana and Prometheus? Loki is the natural fit. Already on AWS with no plans to leave? CloudWatch is the path of least resistance. Already running Elasticsearch for search? Adding log management is straightforward.
Query requirements. If you need to search for arbitrary strings across months of logs regularly, Elasticsearch's full-text index justifies the cost. If 90% of your queries start with "show me logs from this service in the last hour," Loki is more than enough.
Budget. Loki with object storage is the cheapest option by a wide margin. CloudWatch is the most expensive at scale. ELK falls in between, with costs driven primarily by compute and storage for the Elasticsearch cluster.

Frequently Asked Questions

Can I use Loki for compliance or audit logging?

Yes, but with caveats. Loki stores logs durably in object storage and supports configurable retention. For compliance, ensure your object storage bucket has versioning and deletion protection enabled. The main limitation is query performance over very large time ranges -- searching six months of logs is slower than Elasticsearch. For audit trails that are rarely queried, Loki works well. For frequently searched compliance logs, Elasticsearch is stronger.

How do I migrate from ELK to Loki?

Run both systems in parallel. Configure your log shipper (Filebeat or Fluent Bit) to send logs to both Elasticsearch and Loki simultaneously. Move teams to Grafana dashboards one at a time, validating that LogQL queries produce equivalent results. Once all teams are on Loki, decommission Elasticsearch. Expect the migration to take 2-4 weeks for a medium-sized organization.

What is the best log shipper for Kubernetes?

For Loki, use Grafana Alloy (the successor to Promtail). For Elasticsearch, use Filebeat or Fluent Bit. For CloudWatch, use Fluent Bit with the CloudWatch output plugin. Fluent Bit is the most versatile -- it supports all three backends and has a small resource footprint. If you're undecided on your backend, starting with Fluent Bit gives you the most flexibility.

How much log volume is normal for a production application?

A typical microservice generates 1-10 GB of logs per day at moderate traffic. The variation comes from log level configuration -- a service logging every request at INFO produces 10x the volume of one logging only WARN and above. Total log volume for a 50-service Kubernetes cluster usually falls between 50-500 GB per day. If you're above this, review your log levels and consider sampling.

Should I parse logs at ingestion or query time?

Parse at ingestion for fields you query frequently -- severity level, service name, HTTP status code. This creates structured fields that are fast to filter on. Parse at query time for ad-hoc analysis -- extracting a specific field from a JSON blob you rarely search. ELK parses everything at ingestion (via Logstash). Loki parses at query time by default. The hybrid approach -- parse the most important fields at ingestion, leave the rest for query time -- is usually the best tradeoff.

How do I reduce log storage costs?

Five strategies that make the biggest impact: tier retention by log level (ERROR 90 days, DEBUG 7 days), drop or sample high-volume low-value logs at the shipper, compress aggressively (Loki and Elasticsearch both support this), use object storage tiers (S3 Infrequent Access for older logs), and enforce structured logging so you can filter precisely instead of storing everything "just in case."

Conclusion

There's no universally correct answer. ELK gives you the most powerful search but demands the most operational investment. Loki gives you the best cost-to-value ratio for teams already in the Grafana ecosystem. CloudWatch gives you zero-ops convenience at a premium price on AWS.

Whatever you choose, get structured JSON logging right first. A clean, consistent log format makes every backend work better. Add trace IDs from day one. Set retention policies before you have a cost problem, not after. And remember -- the goal isn't to store every log forever. It's to have the right logs available when something breaks at 3 AM.

Centralized Log Management: Loki vs the ELK Stack vs CloudWatch

03:47 -- "checkout p99 is 12 seconds and climbing"

The ELK Stack: Powerful but Heavy

Architecture

Strengths

Weaknesses

Grafana Loki: Label-Based and Cheap

Architecture

Strengths

Weaknesses

AWS CloudWatch Logs: Native but Locked In

Strengths

Weaknesses

Head-to-Head Comparison

Log Shipping Best Practices

How to Choose

Frequently Asked Questions

Can I use Loki for compliance or audit logging?

How do I migrate from ELK to Loki?

What is the best log shipper for Kubernetes?

How much log volume is normal for a production application?

Should I parse logs at ingestion or query time?

How do I reduce log storage costs?

Conclusion

Related Articles

Enjoyed this article?

Comments

Leave a comment

Stay in the loop