Observability

Centralized Log Management: Loki vs the ELK Stack vs CloudWatch

Compare Grafana Loki, the ELK Stack, and AWS CloudWatch Logs for centralized log management. Understand the architecture, query languages, cost tradeoffs, and which solution fits your team and infrastructure.

A
Abhishek Patel10 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Centralized Log Management: Loki vs the ELK Stack vs CloudWatch
Centralized Log Management: Loki vs the ELK Stack vs CloudWatch

Your Logs Are Scattered -- And That's Costing You

When production breaks at 3 AM, nobody wants to SSH into six different servers to grep through log files. Centralized log management aggregates logs from every service, container, and host into a single queryable system. It's the difference between resolving an incident in minutes and spending an hour just finding the right log line.

The three dominant approaches in 2026 are the ELK Stack (Elasticsearch, Logstash, Kibana), Grafana Loki, and AWS CloudWatch Logs. Each makes fundamentally different tradeoffs around cost, query power, and operational complexity. I've run all three in production, and the right choice depends heavily on your existing infrastructure, budget, and team size.

What Is Centralized Log Management?

Definition: Centralized log management is the practice of collecting, shipping, storing, and querying logs from all applications and infrastructure in a single platform. It enables full-text search, structured queries, alerting on log patterns, and correlation with metrics and traces for incident investigation.

Without centralization, logs live on individual hosts or in container stdout. Containers get restarted, hosts get terminated, and those logs vanish. Centralization solves three problems: durability (logs survive host failures), searchability (query across all services at once), and correlation (link logs to metrics and traces).

The ELK Stack: Powerful but Heavy

The ELK Stack -- Elasticsearch, Logstash, Kibana -- has been the default log management solution for over a decade. Elasticsearch provides full-text indexing and search. Logstash (or its lighter replacement, Filebeat) ships logs. Kibana provides visualization and dashboards.

Architecture

Applications --> Filebeat --> Logstash --> Elasticsearch --> Kibana
                   (ship)     (parse)       (index/store)    (query/visualize)

Strengths

  • Full-text search. Elasticsearch indexes every word in every log line. You can search for arbitrary strings, regex patterns, and complex boolean queries across terabytes of data.
  • Rich query language. KQL (Kibana Query Language) and the Elasticsearch Query DSL are powerful. Aggregations, histograms, and statistical queries are built in.
  • Mature ecosystem. Hundreds of Filebeat modules for common log formats. Logstash has parsers for everything.
  • Kibana dashboards. Build log-based visualizations, anomaly detection, and alerting rules.

Weaknesses

  • Resource hungry. Elasticsearch needs significant memory (JVM heap) and fast storage. A production cluster for moderate log volume needs at least 3 nodes with 16 GB RAM each.
  • Operational complexity. Shard management, index lifecycle policies, cluster upgrades, and JVM tuning require dedicated expertise.
  • Cost at scale. Indexing every field in every log line means storage grows fast. Expect 30-50% overhead from the inverted index on top of raw log size.
# filebeat.yml -- shipping container logs to Elasticsearch
filebeat.inputs:
  - type: container
    paths:
      - /var/log/containers/*.log
    processors:
      - add_kubernetes_metadata:
          host: ${NODE_NAME}
          matchers:
            - logs_path:
                logs_path: "/var/log/containers/"

output.elasticsearch:
  hosts: ["https://elasticsearch:9200"]
  username: "elastic"
  password: "${ES_PASSWORD}"
  indices:
    - index: "logs-production-%{+yyyy.MM.dd}"
      when.contains:
        kubernetes.namespace: "production"
    - index: "logs-staging-%{+yyyy.MM.dd}"
      when.contains:
        kubernetes.namespace: "staging"

Grafana Loki: Label-Based and Cheap

Loki takes a radically different approach. Instead of indexing log content, it indexes only labels (metadata like service name, namespace, and pod). The actual log text is stored compressed and unindexed. This makes Loki dramatically cheaper to run but means full-text search requires scanning log chunks.

Architecture

Applications --> Promtail/Alloy --> Loki --> Grafana
                   (ship + label)   (store)   (query/visualize)

Strengths

  • Low cost. No full-text index means storage is 10-20x cheaper than Elasticsearch for the same log volume. Loki stores compressed chunks in object storage (S3, GCS).
  • Kubernetes native. Promtail (or Grafana Alloy) auto-discovers pods and applies labels from Kubernetes metadata.
  • LogQL. Loki's query language is inspired by PromQL. If your team knows Prometheus, LogQL feels familiar.
  • Grafana integration. Logs live alongside metrics and traces in a single Grafana pane. Click from a metric spike to the correlated logs instantly.

Weaknesses

  • Limited full-text search. Searching for a specific string requires scanning all chunks matching your label selector. Broad queries over large time ranges can be slow.
  • Label cardinality constraints. Too many unique label values degrade performance. You can't use high-cardinality fields like request IDs as labels.
  • Fewer parsing features. Loki parses at query time, not at ingestion. Complex parsing (like Logstash grok patterns) needs to happen in LogQL or in the shipping agent.
# LogQL examples

# Find error logs from the order service in the last hour
{namespace="production", app="order-service"} |= "error"

# Parse JSON logs and filter by status code
{app="api-gateway"} | json | status >= 500

# Count errors per service over 5-minute windows
sum by (app) (count_over_time({namespace="production"} |= "error" [5m]))

# Extract latency from logs and compute p99
{app="api-gateway"} | json | unwrap duration_ms | quantile_over_time(0.99, [5m])

AWS CloudWatch Logs: Native but Locked In

If you're running on AWS, CloudWatch Logs is already collecting your Lambda, ECS, and EKS logs. There's zero setup for AWS services -- logs flow automatically. The question is whether CloudWatch's query capabilities and pricing work for your needs.

Strengths

  • Zero infrastructure. No clusters to manage. AWS handles scaling, storage, and availability.
  • Deep AWS integration. Lambda, ECS, EKS, API Gateway, RDS -- logs appear automatically.
  • CloudWatch Logs Insights. A SQL-like query language that's surprisingly capable for ad-hoc analysis.
  • Metric filters. Create CloudWatch metrics from log patterns without external tooling.

Weaknesses

  • Expensive at scale. Ingestion costs $0.50/GB and storage costs $0.03/GB/month. At 100 GB/day, you're paying $1,500/month for ingestion alone.
  • Limited cross-account. Querying logs across multiple AWS accounts requires extra setup (cross-account log groups or a centralized logging account).
  • Vendor lock-in. Your log queries, dashboards, and alerts are AWS-specific. Moving to another platform means rebuilding everything.
  • Slower queries. CloudWatch Logs Insights scans data on query. Complex queries over large time ranges are noticeably slower than Elasticsearch.
-- CloudWatch Logs Insights query examples

-- Find the 10 slowest requests in the last hour
fields @timestamp, @message, duration_ms
| filter duration_ms > 1000
| sort duration_ms desc
| limit 10

-- Error count by service over time
filter @message like /ERROR/
| stats count(*) as errors by bin(5m), service_name

-- Parse JSON and aggregate
parse @message '{"level":"*","service":"*","duration":*}' as level, service, duration
| filter level = "error"
| stats avg(duration), max(duration), count(*) by service

Head-to-Head Comparison

FeatureELK StackGrafana LokiCloudWatch Logs
Full-text searchExcellent (indexed)Functional (scan-based)Good (scan-based)
Query languageKQL / ES Query DSLLogQLLogs Insights (SQL-like)
Storage cost (1 TB/mo)$150-300 (self-hosted)$25-50 (object storage)$530 (ingestion) + $30 (storage)
Operational burdenHigh (cluster management)Medium (simpler architecture)None (fully managed)
Kubernetes integrationGood (Filebeat)Excellent (Promtail/Alloy)Good (Fluent Bit)
Correlation with metricsLimited (separate tool)Excellent (Grafana native)Good (CloudWatch Metrics)
Multi-cloudYesYesAWS only
Retention flexibilityFull control (ILM policies)Full controlPer log group (1 day to 10 years)

Log Shipping Best Practices

Regardless of which backend you choose, these practices apply universally:

  1. Use structured JSON logging. Every log line should be a JSON object with consistent fields: timestamp, level, service, message, and trace_id. This makes parsing trivial on any backend.
  2. Ship from stdout/stderr. Don't write log files inside containers. Write to stdout, and let your log shipper (Filebeat, Promtail, Fluent Bit) collect from the container runtime.
  3. Add Kubernetes metadata. Namespace, pod name, container name, and node name should be attached automatically by your shipping agent. This metadata powers label-based filtering.
  4. Set retention policies by severity. Keep ERROR and WARN logs for 90 days. Keep INFO for 30 days. Drop DEBUG in production or keep it for 7 days max. This single strategy can cut your log storage costs by 60%.
  5. Sample verbose logs. If one service produces 80% of your log volume, sample its DEBUG logs at 10% or add rate limiting in the shipping agent.

Pro tip: Include a trace_id field in every structured log entry. This bridges logs and traces, letting you jump from a log line directly to the distributed trace that produced it. It's the single most valuable correlation field you can add.

How to Choose

The decision usually comes down to three factors:

  1. Existing stack. Already using Grafana and Prometheus? Loki is the natural fit. Already on AWS with no plans to leave? CloudWatch is the path of least resistance. Already running Elasticsearch for search? Adding log management is straightforward.
  2. Query requirements. If you need to search for arbitrary strings across months of logs regularly, Elasticsearch's full-text index justifies the cost. If 90% of your queries start with "show me logs from this service in the last hour," Loki is more than enough.
  3. Budget. Loki with object storage is the cheapest option by a wide margin. CloudWatch is the most expensive at scale. ELK falls in between, with costs driven primarily by compute and storage for the Elasticsearch cluster.

Frequently Asked Questions

Can I use Loki for compliance or audit logging?

Yes, but with caveats. Loki stores logs durably in object storage and supports configurable retention. For compliance, ensure your object storage bucket has versioning and deletion protection enabled. The main limitation is query performance over very large time ranges -- searching six months of logs is slower than Elasticsearch. For audit trails that are rarely queried, Loki works well. For frequently searched compliance logs, Elasticsearch is stronger.

How do I migrate from ELK to Loki?

Run both systems in parallel. Configure your log shipper (Filebeat or Fluent Bit) to send logs to both Elasticsearch and Loki simultaneously. Move teams to Grafana dashboards one at a time, validating that LogQL queries produce equivalent results. Once all teams are on Loki, decommission Elasticsearch. Expect the migration to take 2-4 weeks for a medium-sized organization.

What is the best log shipper for Kubernetes?

For Loki, use Grafana Alloy (the successor to Promtail). For Elasticsearch, use Filebeat or Fluent Bit. For CloudWatch, use Fluent Bit with the CloudWatch output plugin. Fluent Bit is the most versatile -- it supports all three backends and has a small resource footprint. If you're undecided on your backend, starting with Fluent Bit gives you the most flexibility.

How much log volume is normal for a production application?

A typical microservice generates 1-10 GB of logs per day at moderate traffic. The variation comes from log level configuration -- a service logging every request at INFO produces 10x the volume of one logging only WARN and above. Total log volume for a 50-service Kubernetes cluster usually falls between 50-500 GB per day. If you're above this, review your log levels and consider sampling.

Should I parse logs at ingestion or query time?

Parse at ingestion for fields you query frequently -- severity level, service name, HTTP status code. This creates structured fields that are fast to filter on. Parse at query time for ad-hoc analysis -- extracting a specific field from a JSON blob you rarely search. ELK parses everything at ingestion (via Logstash). Loki parses at query time by default. The hybrid approach -- parse the most important fields at ingestion, leave the rest for query time -- is usually the best tradeoff.

How do I reduce log storage costs?

Five strategies that make the biggest impact: tier retention by log level (ERROR 90 days, DEBUG 7 days), drop or sample high-volume low-value logs at the shipper, compress aggressively (Loki and Elasticsearch both support this), use object storage tiers (S3 Infrequent Access for older logs), and enforce structured logging so you can filter precisely instead of storing everything "just in case."

Conclusion

There's no universally correct answer. ELK gives you the most powerful search but demands the most operational investment. Loki gives you the best cost-to-value ratio for teams already in the Grafana ecosystem. CloudWatch gives you zero-ops convenience at a premium price on AWS.

Whatever you choose, get structured JSON logging right first. A clean, consistent log format makes every backend work better. Add trace IDs from day one. Set retention policies before you have a cost problem, not after. And remember -- the goal isn't to store every log forever. It's to have the right logs available when something breaks at 3 AM.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.