Practical AWS Cost Optimization for Large-Scale B2C Services
This post is also available in .
Summary
| Aspect | Details |
|---|---|
| Challenge | Costs became a black box as traffic grew. Over-provisioning for spike tolerance and abandoned, unnecessary resources were impacting business profits. |
| Actions | Optimized through architectural reviews, such as introducing SQS for load smoothing (separating synchronous/asynchronous paths) and reducing S3 requests by aggregating logs with Fluent Bit. |
| Operations | Established automated anomaly detection with AWS Budgets and resource inventory management using an EndDate tag. Fostered a culture of considering cost efficiency from the initial design phase. |
| Results & Outcomes | Reduced monthly AWS costs by 30%. Achieved infrastructure cost transparency and built a scalable, low-cost foundation aligned with business growth. |
Challenges & Background
As our B2C service grew, we faced the challenge of rising infrastructure costs that began to strain our business profits, despite the welcome increase in traffic. At the time, we had three major problems.
Discrepancy Between B2C Service Growth and Costs
Infrastructure costs swelled in proportion to the increase in user numbers, significantly exceeding the cost projections in our initial business plan. In particular, to handle access spikes during campaigns, we were forced to maintain resources provisioned for peak times, leading to a chronically inefficient cost structure.
Lack of Management Transparency
We were using AWS Organizations to run multiple projects, but we were unable to immediately identify which services or features were the main drivers of cost increases. Cost management was purely reactive, consisting of "checking the bill at the end of the month." Our operations were always a step behind, starting investigations only after an unexpectedly high bill arrived.
Abandoned Resources
In the rush to prioritize development speed, we often found instances of forgotten test environments or resources for completed campaigns still running. The existence of these so-called "rogue accounts" and "rogue resources" added up, driving up overall costs.
Action Plan
To address these challenges, we decided to create a "system" for continuous cost optimization, rather than performing one-off reduction tasks.
Automated Detection Through "Mechanisms"
Instead of dedicating human resources to monitor dashboards daily, we aimed to establish a system for automatically detecting anomalies using AWS Budgets. Our goal was to be able to initiate a first response within 24 hours of an anomaly by setting up alerts not only for budget progress but also for sharp day-over-day fluctuations.
Architectural Optimization
We shifted from the fixed mindset of "provisioning resources for peak demand" to a configuration where "costs vary in response to demand." Specifically, we focused on improvements at the application architecture level, such as introducing queuing with SQS, which will be described later.
Enforcing Governance
To clarify "who is using what, for what purpose, and until when," we prioritized the creation of a tagging policy. In particular, we made the EndDate tag mandatory and established an operational flow to visualize and delete resources past their expiration date, thereby structurally preventing the abandonment of unnecessary resources.
Verification (Investigation & Measurement)
Before implementing cost reductions, we first conducted a detailed breakdown of our current cost structure to identify the areas with the greatest potential for savings.
Visualizing the Cost Structure
First, we analyzed the cost trends over the past six months using AWS Cost Explorer.
- Identification by service: We confirmed that the majority of our charges were concentrated in EC2 (including RDS), S3, DynamoDB, and Data Transfer.
- Utilizing AWS Resource Groups: Since tagging was insufficient, we used Resource Groups to provisionally group resources by project, visualizing which business units were driving costs.
- Consulting AWS Trusted Advisor: We identified "low utilization Amazon EC2 instances" and "idle DB instances" to create a list of resources that could be immediately stopped or downsized.
Detailed Analysis: Logs and API Calls
We cross-referenced the ingestion volume and charges for CloudWatch Logs to investigate resources that, while low in unit cost, were overwhelming in "volume."
- Abnormal log output volume: We discovered that a specific service was outputting a massive amount of DEBUG level logs even in the production environment, which was driving up CloudWatch Logs ingestion fees.
- S3 request counts: We identified buckets where the cost of PUT/LIST requests was higher than the cost of storage (GB). We confirmed application behavior that was frequently writing tiny files of a few KB.
Analyzing DynamoDB Access Patterns
For DynamoDB, we sampled CloudWatch metrics and the "Storage details" of each table.
- Skewed access frequency: We found that only a few percent of the total data had been accessed within the last 30 days, with the vast majority being "archive data that needs to be retained but is not referenced."
- Verifying item sizes: We identified cases where binary data (Blobs) of several megabytes were stored directly as attribute values, wastefully consuming Read/Write Capacity Units (RCU/WCU).
Implementation Details
Based on the cost drivers identified in the verification phase, we implemented specific optimizations, including architectural changes.
Compute: Spike Mitigation and Resource Rightsizing
To handle sudden traffic spikes during campaigns, we changed our architecture to smooth out the load on the database (RDS) rather than sizing it for peak demand.
- Introducing SQS for queuing: For non-critical write operations that could tolerate a few seconds to a few minutes of processing delay (e.g., access logs, activity histories), we implemented buffering with SQS. By "offloading" the peak load to be processed later, we eliminated the need to size our RDS instance for the absolute peak, allowing us to downsize to a cheaper instance.
Before: Design based on peak load
After: Optimization through load smoothing
- Rightsizing log levels: To reduce CloudWatch Logs costs, we changed the log level in the production environment from DEBUG to INFO. We also strengthened filtering to retain only necessary audit and error logs.
Storage & DB: Optimizing Placement Based on Data Characteristics
We optimized storage locations and classes by focusing on the "freshness" and "size" of the data.
-
DynamoDB cost optimization
- For tables holding infrequently accessed historical data, we migrated the table class to Standard-IA (Infrequent Access). This reduced storage costs by about 60%.
- We changed the design to offload large Blob data (several MB) to Amazon S3 instead of storing it directly in DynamoDB. By storing only the S3 object key (path information) in DynamoDB, we dramatically reduced WCU/RCU consumption.
-
S3 lifecycle management:
- Sidecar buffering with Fluent Bit: Instead of having the application write logs directly to S3, we configured a Fluent Bit sidecar to receive the logs, aggregate them in memory into MB-sized chunks, and then upload them. This physically decimated the number of PUT requests to S3.
- Full adoption of Intelligent-Tiering: For stored data, we enabled S3 Intelligent-Tiering, delegating class transitions based on access frequency to AWS. This optimized the balance between "retrieval cost" and "storage cost" without adding management overhead.
- Automated transition to Glacier: For logs requiring long-term retention for compliance, we defined a lifecycle policy. Items older than 60 days are automatically moved to Glacier Instant Retrieval, minimizing the storage unit price while maintaining retrieval speed.
Contracts & Operations: Applying Governance and Discount Options
In parallel with technical changes, we introduced operational rules to prevent the creation of wasteful resources.
- Enforcing the EndDate tag: We made the EndDate tag mandatory for all resources in development and test environments. We established an operational flow to list expired resources weekly and delete those that are no longer needed.
- Account consolidation: We migrated scattered "rogue accounts" to a managed account under Organizations. By consolidating them into a single bill (Consolidated Billing), we enabled volume discounts for services like S3 to apply across the entire organization.
- Purchasing discount plans: For resources with a stable baseline usage, we applied Savings Plans and Reserved Instances (RI). This locked in cost savings compared to on-demand pricing.
Quantitative Results
After implementing this series of measures, we observed the following changes in our infrastructure costs and operational metrics.
- Reduction in monthly AWS costs: Compared to before the measures, we reduced the entire service's monthly cost by approximately 30%. The elimination of oversized RDS specs for peak times and the review of DynamoDB table classes were major contributors.
- Optimization of storage unit price: The application of S3 Intelligent-Tiering and lifecycle policies improved the cost per gigabyte of storage. Specifically, for long-term log storage costs, we achieved about a 70% cost reduction through the automatic transition to Glacier.
- Earlier anomaly detection: By integrating AWS Budgets with Slack notifications, we established a system that allows us to detect and begin responding to unexpected cost increases (e.g., API loops, misconfigurations) within 24 hours of their occurrence.
- Reduction in request costs: By switching to "MB-sized batch uploads," the number of PUT requests to S3 decreased significantly. For our log storage buckets, we reduced the request-based charges, which had become more expensive than the storage (GB) charges, by 80%.
Qualitative Results
In addition to numerical savings, there were positive changes in the development team's operations.
-
Establishing "cost as part of the design" awareness: A culture was fostered where the cost implications of a design are considered in the early stages of development, not just performance and availability. Specifically, developers began to proactively design systems that offload large payloads to S3 instead of writing them directly to a database (DynamoDB) and to implement resource smoothing using asynchronous processing (SQS).
-
Strengthened governance: Account consolidation and the application of tag policies have structurally prevented the emergence of "rogue resources." This also had the side benefit of enabling centralized security management.
Performance Optimization
Enhanced system response speed and stability through database and delivery route optimization.
Developer Productivity & Quality Automation
Maintained continuous development velocity through quality assurance automation and build pipeline improvements.
Enhanced User Experience
Improved usability and reliability from the user's perspective, including search experiences and reservation systems.