FinOps for Engineers: Where Cloud Bills Hide Money (2026)

Q: How do I find orphaned EBS volumes?

Run aws ec2 describe-volumes --filters Name=status,Values=available . Any volume in "available" state is unattached. For each, check creation date and tags. Anything older than 30 days with no DoNotDelete tag is almost certainly waste — take a final snapshot for safety, then delete. Many teams find $500-3,000/month in pure orphan-EBS waste.

The Bills That Actually Grow Without Anyone Noticing

Most cloud bills don't grow because of "the architecture." They grow because of small operational hygiene gaps that compound month over month: orphaned EBS volumes, NAT Gateway data charges, idle RDS replicas, S3 buckets without lifecycle policies, CloudWatch logs at 90-day retention that should be 14, dev-environment ALBs nobody turned off. Engineers can fix every item in this article without management theater, executive buy-in, or a FinOps consultant. This is the practitioner playbook for finding and eliminating the cloud waste that's hidden in plain sight.

Last updated: April 2026 — verified AWS Compute Optimizer, Trusted Advisor, Infracost, and AWS pricing for the line items below.

Where Cloud Bills Actually Hide Money

1. Orphaned EBS Volumes (the #1 most common waste)

When you terminate an EC2 instance, the EBS volume doesn't auto-delete unless the volume's DeleteOnTermination flag is set. By default for additional volumes, it's not — so every "I'll just terminate this" leaves a 100 GB gp3 volume behind at $8/month. Multiplied by years of churn across hundreds of instances, this is often $500-3,000/month in pure waste.

How to find them:

aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].[VolumeId,Size,CreateTime,Tags]' \
  --output table

Any volume in available state is unattached. For each, check creation date and tags — anything older than 30 days with no DoNotDelete tag is almost certainly safe to delete after a final snapshot.

2. NAT Gateway Charges (the data-egress trap)

NAT Gateways charge $0.045/hour for the gateway itself (~$33/month per AZ) AND $0.045 per GB processed. The processing charge is what surprises teams: a service pulling data from S3 through a NAT Gateway pays for it twice (once via NAT, once via S3 egress at the far end if cross-region). For high-throughput workloads, this is often $500-5,000/month.

The fix:

VPC endpoints for AWS services: S3, DynamoDB, ECR, Secrets Manager. Replaces NAT-routed traffic with direct VPC endpoint traffic, free for gateway endpoints (S3, DynamoDB) and ~$0.01/GB for interface endpoints (ECR, Secrets Manager) — vs $0.045/GB through NAT.
NAT Instance instead of NAT Gateway for low-traffic VPCs: A t4g.nano NAT instance ~$3/month vs $33/month for the gateway, plus no per-GB charge (just NAT-instance bandwidth). Worth it for dev/staging VPCs.
Audit cross-AZ NAT traffic: One NAT Gateway per AZ vs one shared NAT — sharing across AZs incurs cross-AZ data charges that can be hidden in the NAT bill.

3. Idle RDS Read Replicas

RDS read replicas exist for two reasons: read scaling and DR. Both are seasonal — a Black Friday read replica might only need to run November-January. The pattern: someone provisions a replica for a launch, the launch ends, the replica keeps running. A db.r6g.xlarge replica is ~$200/month; teams routinely have 3-5 of these idling at any given time. Cost: $600-1,000/month.

How to find them: CloudWatch DatabaseConnections metric. Any read replica with average connections under 5 over 30 days is suspect. Check the application code that uses the replica's endpoint — if no app routes to it, delete.

4. S3 Buckets Without Lifecycle Policies

S3 Standard at $0.023/GB/month is fine for hot data; for objects accessed once and forgotten, you should be on Glacier Instant Retrieval (~$0.004/GB) or Glacier Deep Archive (~$0.00099/GB). The default is Standard, and almost nobody sets lifecycle policies on legacy buckets. A 5 TB bucket of "old logs we'll never look at" costs $115/month on Standard vs $5/month on Deep Archive.

The fix: Every bucket should have a lifecycle policy. Bare minimum:

{
  "Rules": [{
    "Id": "ArchiveOldObjects",
    "Status": "Enabled",
    "Transitions": [{
      "Days": 30, "StorageClass": "STANDARD_IA"
    }, {
      "Days": 90, "StorageClass": "GLACIER_IR"
    }, {
      "Days": 180, "StorageClass": "DEEP_ARCHIVE"
    }],
    "NoncurrentVersionExpiration": { "NoncurrentDays": 30 }
  }]
}

For high-volume log buckets, also set AbortIncompleteMultipartUpload with 1-day cleanup — incomplete uploads sit forever otherwise.

5. CloudWatch Logs at 90-Day Retention

The default CloudWatch Logs retention is "Never expire." Most teams change it to 90 days and consider the job done. For most logs, 14 days is plenty (debugging recent incidents) and 30 days is enough for SLO calculations. CloudWatch Logs storage is $0.03/GB-month — for a chatty service generating 100 GB/month, 90-day retention is $9 vs 14-day at $1.40. Across hundreds of log groups, this is $200-2,000/month.

The fix: Audit log groups, set retention to 14 days for application logs, 30 days for security audit logs, 1 year for compliance-required logs. For long-term retention, archive to S3 and use Athena for queries — meaningfully cheaper.

6. Dev-Environment Load Balancers Nobody Turned Off

Each ALB / NLB has a minimum monthly charge (~$16/month) plus per-LCU usage. Dev environments often have ALBs that ran for 5 minutes during a feature test and then sat idle for the rest of the year. A team with 30 microservices each having a dev ALB is ~$480/month for ALBs nobody uses.

The fix: Time-bound dev environments. Spin up on demand, tear down at end of day. Tools like Karpenter for compute, Terraform Cloud workspaces for environment management. For environments that must persist, share an ALB with host-based routing instead of one ALB per service.

7. Lambda Functions With Wildly Wrong Memory

Lambda pricing scales with memory (you pay for memory × duration). A function with 3 GB memory that only uses 256 MB is wasting 90%+ of its bill. The catch: Lambda's CPU also scales with memory, so increasing memory can decrease duration enough to net-decrease cost. AWS Lambda Power Tuning finds the optimal point.

The fix: For each high-volume Lambda, run Power Tuning. Right-size memory based on actual usage. Most teams find 30-50% savings on Lambda spend with one weekend of tuning.

8. Reserved Instances / Savings Plans Nobody Bought

For predictable baseline workloads (RDS, EC2 instances that run 24/7, ElastiCache), Savings Plans / Reserved Instances offer 30-40% discount vs on-demand. Most teams under-purchase or never purchase. Compute Savings Plans are flexible enough that the "what if usage drops" risk is minimal — they apply to any compute family in any region.

The fix: Audit AWS Cost Explorer's "Savings Plans" recommendations. Buy 1-year Compute Savings Plans for the bottom 80% of your sustained spend (the floor that's been steady for 3+ months). Annual savings: 25-35% on covered spend.

The Tagging Discipline That Survives Org Changes

Cost-allocation tags are the foundation of every FinOps action. Without tags, you can't answer "how much does Service X cost?" let alone "how much does Customer Y cost us?" Most orgs start with overly-elaborate tagging schemes and then drift; the simplest scheme that works:

Tag	Required?	Values	Purpose
Environment	Yes	prod / staging / dev / test	Filter dev/test from prod cost reporting
Service	Yes	service-name (kebab-case)	Cost-per-service attribution
Team	Yes	team-name	Cost-per-team for showback
CostCenter	Conditional	cost-center-id	Finance accounting (if you do chargeback)
ManagedBy	Recommended	terraform / cdk / manual	Identify untracked resources

Hard-enforce these via AWS Config / Service Control Policies. Untagged resources should be flagged daily and auto-deleted after 7-14 days for non-production environments. Once tagging is enforced, every cost report becomes meaningful.

The Rightsizing Workflow

AWS Compute Optimizer + Trusted Advisor + manual review, run weekly:

AWS Compute Optimizer: Identifies over-provisioned EC2, EBS, Lambda, RDS instances. Flags "your r5.2xlarge has been at 8% CPU for 14 days, recommend r5.large." Free, ships actionable recommendations.
AWS Trusted Advisor: Lists low-utilization EC2, idle load balancers, underutilized EBS, unused EIPs. The AWS Business Support tier (~$100/month) unlocks the full Trusted Advisor checks; without Business Support, you see a subset.
Manual review: Some workloads have spikes that auto-tools miss. Review the recommendations weekly, accept the safe ones, defer the risky ones for explicit testing.
Track savings: After each rightsizing, note expected vs actual savings the next month. Builds trust in the workflow and surfaces edge cases where recommendations are wrong.

Infracost in CI: Catching Cost Regressions in PRs

The most leverage move that almost nobody does: integrate Infracost into your CI to show estimated cost diff on every Terraform PR. The PR comment becomes:

Infracost diff:
+ Monthly cost will increase by $238 (+47%)

  ~ aws_db_instance.replica
    ~ instance_class: db.r5.large -> db.r5.xlarge
      +$180/mo

  + aws_nat_gateway.private_egress
    +$33/mo + data processing charges

  ~ aws_ecs_service.api
    ~ desired_count: 4 -> 8
      +$25/mo

Reviewers see cost impact alongside code changes. Most cost regressions get caught at PR review instead of next month's bill review. Setup is a single GitHub Action, free for the first 1000 runs/month, low overhead.

The 30-Minute Weekly Bill Review

Set a weekly recurring 30-min meeting (or solo time block) for "look at the bill." Format:

Top 10 cost drivers: Pull "service" cost breakdown from Cost Explorer. Note any service whose cost grew over 10% week-over-week.
Anomaly detection: AWS Cost Anomaly Detection flags unusual spend; review the week's anomalies and explain or fix each.
Open recommendations: Compute Optimizer / Trusted Advisor pending recommendations. Approve safe ones.
Outliers: Top 5 most-expensive untagged or low-utilization resources. Either tag and explain or kill.
Track trend: Total monthly forecast vs last month's. If it's growing faster than headcount/revenue, dig in.

Most teams skip this and discover cost issues at the quarterly finance review, by which time the cost has compounded for months. Weekly reviews catch issues 4-12x faster.

Where to Look for the Hidden Stuff

Service	Common waste pattern	How to find
EBS	Orphaned volumes, oversized gp2 (use gp3)	`aws ec2 describe-volumes` + Compute Optimizer
EC2	Idle, oversized, on-demand instead of Spot/RIs	Compute Optimizer, Trusted Advisor
RDS	Idle replicas, over-provisioned compute, default storage allocation	CloudWatch DatabaseConnections
S3	No lifecycle, no Intelligent-Tiering, incomplete uploads	S3 Storage Lens
Lambda	Memory misconfigured, idle provisioned concurrency	Lambda Power Tuning
NAT Gateway	Cross-AZ traffic, no VPC endpoints for AWS services	VPC Flow Logs aggregation
CloudWatch Logs	"Never expire" retention, untagged log groups	aws logs describe-log-groups
ALB / NLB	Idle dev-env load balancers	CloudWatch RequestCount metric
ECR	Old image tags accumulating	ECR lifecycle policies
Elastic IPs	Allocated but not attached (charged when idle)	`aws ec2 describe-addresses`

Frequently Asked Questions

What is the biggest source of cloud cost waste?

Across most orgs, the top 5 are: orphaned EBS volumes (15-30% of waste), over-provisioned compute (20-35%), idle databases / replicas (10-15%), CloudWatch logs at default retention (5-10%), and NAT Gateway data processing (5-15%). The pattern is operational hygiene, not architecture — engineers can fix all of these without organizational buy-in.

How do I find orphaned EBS volumes?

Run aws ec2 describe-volumes --filters Name=status,Values=available. Any volume in "available" state is unattached. For each, check creation date and tags. Anything older than 30 days with no DoNotDelete tag is almost certainly waste — take a final snapshot for safety, then delete. Many teams find $500-3,000/month in pure orphan-EBS waste.

What's the best CloudWatch Logs retention setting?

14 days for application logs, 30 days for security audit logs, 1 year for compliance-required logs (SOC 2, HIPAA), and never the "Never expire" default. For long-term retention, archive to S3 with lifecycle policies and use Athena for queries — much cheaper than keeping logs in CloudWatch. Audit your log groups and update retention; this commonly saves $200-2,000/month.

When should I buy AWS Savings Plans?

For your sustained baseline spend — the bottom 80% of usage that's been steady for 3+ months. 1-year Compute Savings Plans give 25-35% savings on covered spend with high flexibility (apply to any compute family/region). Avoid 3-year plans unless usage is highly predictable; the inflexibility usually doesn't justify the marginal extra savings. Cost Explorer's Savings Plans recommendations are usually accurate.

How do I integrate Infracost into CI?

Add the Infracost GitHub Action to your Terraform repo. It runs terraform plan, calculates the cost diff vs main branch, and posts a comment on the PR showing the monthly cost change line by line. Free for the first 1000 runs/month. Setup is roughly 30 minutes for a single repo. The leverage is enormous — most cost regressions get caught at PR review rather than at next month's bill review.

What tags should I require on cloud resources?

Five tags are sufficient: Environment (prod/staging/dev/test), Service (service-name), Team (team-name), CostCenter (if you do chargeback), and ManagedBy (terraform/cdk/manual). Hard-enforce via AWS Config rules or Service Control Policies; untagged resources flagged daily and auto-deleted after 7-14 days for non-production. Once tagging is enforced, every cost report becomes meaningful.

Bottom Line

Cloud cost optimization for engineers isn't strategic — it's operational hygiene applied weekly. Run Compute Optimizer and Trusted Advisor weekly. Set lifecycle policies on every S3 bucket. Audit CloudWatch Logs retention. Hunt orphaned EBS volumes monthly. Tag rigorously and enforce. Run Infracost in CI. Buy Savings Plans for steady baseline. Each item is small; together they typically save 20-40% of an unmanaged AWS bill in the first quarter without changing any architecture. For broader strategic cost-optimization work see cloud cost optimization; for K8s-specific tooling see Kubernetes cost visibility.

FinOps for Engineers: Tagging, Rightsizing & Where the Money Actually Hides