Recovery from MySQL Failure and Stabilization of Write Performance

mysql
aurora
cloudwatch
aws
chartjs
typescript
python

Published on 2021/06/13

2021/10/11

This post is also available in .

 Summary

Perspective
Content


Issue
The production DB failed during business hours, causing resend retries to spike and degrading overall response. In addition, visualization was insufficient, so it was impossible to detect the risk of recurrence in advance.

Initial response
Immediately set up Aurora automatic failover and read replica distribution, and established continuous monitoring of Slow Query / CloudWatch metrics / error logs.

Investigation
Performed correlation analysis of CloudWatch metrics and event logs, and identified that the batch process’s DELETE→INSERT pattern was putting pressure on memory and I/O.

Action
Partitioned the table by month and revamped the batch to TRUNCATE + multi-row INSERT. Operationally, established a combined practice of phased migration, instant rollback, and weekly reporting.

Results / Outcomes
Suppressed the risk of unplanned failover recurrence. Stabilized batch time and load fluctuation, improving overall system availability and decision-making speed.

 Incident overviewThe production DB server terminated abnormally during the day (business hours). Responses across the entire application stopped for several minutes.
It was automatically restarted immediately afterward, but there was a temporary spike in load due to transaction retries.
 Initial response (ensuring availability) Introduction of read replicasDistributed read traffic to Aurora read replicas to reduce load on the primary.

By configuring the system so that the application can fail over to read replicas in the event of a failure, the business impact of any recurrence was minimized.
 Additional failover settingsEnabled Aurora’s automatic failover feature.

Configured automatic switchover to a read replica within several tens of seconds when the primary fails, minimizing system downtime.
 Strengthening metrics and log collectionContinuously collected the following via CloudWatch Metrics:
CPU utilization, memory usage, I/O wait, number of connections, query latency
Aurora-specific metrics such as DatabaseConnections, Deadlocks, FreeableMemory, DiskQueueDepth
Additionally, enabled slow_query_log and performance_schema to allow bottleneck analysis at the SQL level.

This improved the accuracy of bottleneck identification for reproducing the failure and preventing recurrence.
 Investigation and measurement phase Checking failover occurrencesRetrieved Aurora event logs via CloudWatch Logs and identified failover occurrence times for the past month.
Used a boto3 script to automatically scan log streams and detect failover timestamps.
Confirmed that occurrences were concentrated during business hours and that this was a reproducible load event.
import boto3, datetime
logs = boto3.client('logs')

end = int(datetime.datetime.utcnow().timestamp())
start = end - 60 * 60 * 24 * 60

resp = logs.filter_log_events(
    logGroupName='/aws/rds/instance/my-aurora-instance/error',
    startTime=start * 1000,
    endTime=end * 1000,
    filterPattern='RDS-EVENT-0169'
)
for e in resp['events']:
    print(e['timestamp'], e['message'])
 Analysis of resource metricsChecked resource status immediately before failover using CloudWatch Metrics.
Observed a sharp drop in FreeableMemory and identified memory exhaustion as a possible trigger for failover.
At the same time, CPU utilization and I/O wait also increased, suggesting concentrated I/O load.
cw = boto3.client('cloudwatch')
metrics = cw.get_metric_statistics(
    Namespace='AWS/RDS',
    MetricName='FreeableMemory',
    Dimensions=[{'Name': 'DBInstanceIdentifier', 'Value': 'my-aurora-instance'}],
    StartTime=datetime.datetime.utcnow() - datetime.timedelta(minutes=10),
    EndTime=datetime.datetime.utcnow(),
    Period=60,
    Statistics=['Average', 'Minimum']
)
print(metrics['Datapoints'])

 Collecting slow query logs and correlation analysisSince Aurora automatically forwards slow_query_log to CloudWatch Logs, extracted slow queries from the same log group using boto3.
Analyzed queries issued in the few minutes immediately before failover, extracted execution time, table name, and DML type, and compared them with metrics.
Analysis confirmed that multiple DELETE and INSERT queries were being issued simultaneously against the same table.
In particular, a batch process that fully replaced advertising data at the beginning of the month was running during the relevant time window.
resp = logs.filter_log_events(
    logGroupName='/aws/rds/instance/my-aurora-instance/slowquery',
    startTime=start * 1000,
    endTime=end * 1000,
    filterPattern='"Query_time:"'
)
for e in resp['events']:
    print(e['timestamp'], e['message'])

 Identifying the root causeThe relevant batch process rebuilt data in a full DELETE → full INSERT pattern,

consuming large amounts of Aurora’s buffer pool and temporary space.
This exhausted memory and caused Aurora to automatically trigger failover.
 Verification phase Identifying slow queries and analyzing causesIdentified the advertising log aggregation batch process that was running during the day when the failure occurred.
This process repeatedly performed DELETE → INSERT DML operations on large volumes of data for the period from the 1st of each month to the previous day.
The large volume of DML caused a rapid increase in memory usage for temporary space, inducing I/O contention and performance delays.
Log analysis confirmed that free memory was under pressure during processing.
 Improvement policy and designProposed and implemented monthly partitioning in MySQL.
Split each month’s advertising logs into independent partitions and changed the structure so that only the most recent month, which has the highest access frequency, is scanned.
Replaced the conventional DELETE → INSERT with TRUNCATE → INSERT (replacing DML with DDL).
Prevented memory overuse and transaction log bloat.
Objectives:
Improve index efficiency
Improve I/O locality
Reduce DML overhead


CREATE TABLE ad_logs (
  id BIGINT AUTO_INCREMENT PRIMARY KEY,
  ad_id VARCHAR(32) NOT NULL,
  click_count INT NOT NULL,
  impression_count INT NOT NULL,
  log_date DATE NOT NULL
)
ENGINE=InnoDB
PARTITION BY RANGE (TO_DAYS(log_date)) (
  PARTITION p202509 VALUES LESS THAN (TO_DAYS('2025-10-01')),
  PARTITION p202510 VALUES LESS THAN (TO_DAYS('2025-11-01')),
  PARTITION pmax VALUES LESS THAN MAXVALUE
);
INSERT processing optimization
Changed the batch INSERT for reloading data from row-by-row to multi-row INSERT.
Reduced the number of writes per transaction, decreasing network round trips to Aurora and log I/O.
In the test environment, confirmed speed improvements compared to single-row INSERT.
In some tests, also tried direct loading via LOAD DATA LOCAL INFILE and demonstrated performance gains for bulk loads.

 Verification in the staging environmentSampled production data and verified the effects of partitioning and INSERT optimization. The comparison targets were “before application (DELETE+INSERT pattern)” and “after application (TRUNCATE+multi-row INSERT pattern).”
 Verification itemsResponse time for full scans
Memory usage and I/O wait time
Impact on backup and replication
 Verification resultsThe time-consuming DELETE processing completed in under one second with TRUNCATE.
Multi-row INSERT shortened the time required for reloading.
Significantly reduced memory consumption during DML operations and lowered the incidence of I/O waits.
Eliminated the risk of Aurora failover and confirmed stable operation.
 Migration plan and recurrence prevention Migration targets and prioritizationSince there were dozens of target tables, defined priorities along three axes: data size, update frequency, and business impact.
Performed phased migration starting with large tables where I/O load was concentrated, such as monthly aggregations and ad delivery logs.
Prior to migration, inventoried each table’s partitioning status, index structure, and schema compatibility,

clarified dependencies, and then finalized the migration order.
 Phased migration policyBatch-level migration: first targeted tables referenced and updated by batch processes,

and gradually switched from the DELETE → INSERT pattern to the TRUNCATE → multi-row INSERT pattern.
Phased deployment: rolled out to staging and then production with failover detection enabled.
Rollback plan: took Aurora snapshots so that the old configuration could be restored immediately if anomalies were detected. Also backed up parameter groups so that any DB settings changed during migration could be reverted as needed.
 Risk management during switchoverLimited switchover timing to late-night hours when access load is low.
Applied settings to immediately perform automatic recovery (failback) if failover is detected.
Immediately after migration, monitored slowquery and error logs at 5-minute intervals with CloudWatch Logs Insights to detect query performance anomalies and resource pressure early.
 Continuous monitoring and recurrence prevention frameworkAutomation of CloudWatch metrics monitoring
Set FreeableMemory, CPUUtilization, DiskQueueDepth, and Deadlocks as continuous monitoring targets.
When thresholds are exceeded, automatically send Slack notifications and take snapshots at the same time.

Weekly report operations
Periodically collected key metrics and slow query counts with a boto3 script.
Automatically graphed abnormal trends and quantitatively tracked performance degradation.

Monthly maintenance cycle
Regularly added partitions, dropped old data, and rebuilt indexes.
Continuously analyzed batch execution plans and tuned queries.

These mechanisms established a system-level framework to ensure recurrence prevention.
 Results and outcomes Quantitative outcomesBelow are candidate indicators that can be added as quantitative outcomes based on the measures so far (introduction of read replicas / automatic failover / strengthened metrics / partitioning & TRUNCATE + multi-row INSERT, etc.),

along with what to look for and how to measure them.


Indicator
What it means (what to look at)
How to check it


Number of failovers (monthly average)
How often server switchover troubles occur. If it has decreased, it’s evidence that system stability and recovery mechanisms are in place.
Check AWS records (event logs) and alerts. Organize by cause (failure / operational error / load). Also look at time to recovery.

Batch processing time
Time required for data aggregation and recalculation. If it’s shorter, recovery and periodic processing are more efficient.
Automatically record start and end times of processing and track averages and 95th percentile times.

RTO (time until switchover completion)
Speed from failure occurrence until service recovery. If it’s stable even during normal business hours, that’s reassuring.
Measure the difference between the failure occurrence time and the time when the application becomes usable again.

Minimum memory and fluctuation range
Shows whether the server is running with sufficient headroom. If it drops extremely low, there is a possibility of another incident.
Use AWS monitoring to compare minimum values and fluctuation ranges over periods.

Disk I/O wait time (P95)
Indicates how smoothly data reads and writes are occurring. If it gets slower, it’s a sign that processing congestion is near.
Track it over time in monitoring and compare changes with just before the incident.

Share of top slow queries
Shows how much “a few slow operations” affect the whole. An indicator of whether you are continuously improving a small number of heavy operations.
Produce a ranking of slow operations and track what percentage of the total is accounted for by the top operations.

DDL lock impact time (blocking time)
Measures the impact of maintenance work. Indicates whether you are shortening the time during which users “temporarily cannot use” the system.
Automatically aggregate the time during which reads and writes were stopped during work.

 Qualitative outcomesVisualized the root cause of failover and established a reproducible analysis framework.
By reviewing partition design, batch structure, and monitoring design as a whole, transformed operations so that the operations team can “understand and judge” failure trends numerically.
Shared knowledge about performance tuning between development and operations, forming a flow that incorporates load testing, monitoring, and tuning into the overall development process.
The data platform team institutionalized monthly risk reviews and established an operational process to detect and address failover precursors in advance.
 Reflections and learnings Technical perspectiveFailover was not just a failure phenomenon, but the result of the overall system load design exceeding its critical point.

Through correlation analysis of metrics and slow query logs, it was possible to identify “DML design,” rather than “CPU or memory,” as the root cause.
 Operational and organizational perspectiveBy organizing incident response as a loop of observation, hypothesis, and verification rather than “solving individual events,” the entire team was able to work on recurrence prevention using a common analytical framework.
By integrating CloudWatch, boto3, and Aurora logs into an observability platform, the team moved away from person-dependent troubleshooting. This became the first step toward automating and visualizing the incident response process.
 Future developmentsAim for an architecture that maintains stable throughput even during traffic peaks by making batch processing asynchronous and introducing queuing (SQS, Lambda integration).
Through regular performance reviews, evolve from “operations for monitoring” to “monitoring for improvement.”

Perspective	Content
Issue	The production DB failed during business hours, causing resend retries to spike and degrading overall response. In addition, visualization was insufficient, so it was impossible to detect the risk of recurrence in advance.
Initial response	Immediately set up Aurora automatic failover and read replica distribution, and established continuous monitoring of Slow Query / CloudWatch metrics / error logs.
Investigation	Performed correlation analysis of CloudWatch metrics and event logs, and identified that the batch process’s DELETE→INSERT pattern was putting pressure on memory and I/O.
Action	Partitioned the table by month and revamped the batch to TRUNCATE + multi-row INSERT. Operationally, established a combined practice of phased migration, instant rollback, and weekly reporting.
Results / Outcomes	Suppressed the risk of unplanned failover recurrence. Stabilized batch time and load fluctuation, improving overall system availability and decision-making speed.

Indicator	What it means (what to look at)	How to check it
Number of failovers (monthly average)	How often server switchover troubles occur. If it has decreased, it’s evidence that system stability and recovery mechanisms are in place.	Check AWS records (event logs) and alerts. Organize by cause (failure / operational error / load). Also look at time to recovery.
Batch processing time	Time required for data aggregation and recalculation. If it’s shorter, recovery and periodic processing are more efficient.	Automatically record start and end times of processing and track averages and 95th percentile times.
RTO (time until switchover completion)	Speed from failure occurrence until service recovery. If it’s stable even during normal business hours, that’s reassuring.	Measure the difference between the failure occurrence time and the time when the application becomes usable again.
Minimum memory and fluctuation range	Shows whether the server is running with sufficient headroom. If it drops extremely low, there is a possibility of another incident.	Use AWS monitoring to compare minimum values and fluctuation ranges over periods.
Disk I/O wait time (P95)	Indicates how smoothly data reads and writes are occurring. If it gets slower, it’s a sign that processing congestion is near.	Track it over time in monitoring and compare changes with just before the incident.
Share of top slow queries	Shows how much “a few slow operations” affect the whole. An indicator of whether you are continuously improving a small number of heavy operations.	Produce a ranking of slow operations and track what percentage of the total is accounted for by the top operations.
DDL lock impact time (blocking time)	Measures the impact of maintenance work. Indicates whether you are shortening the time during which users “temporarily cannot use” the system.	Automatically aggregate the time during which reads and writes were stopped during work.

Performance Optimization

Enhanced system response speed and stability through database and delivery route optimization.

Developer Productivity & Quality Automation

Maintained continuous development velocity through quality assurance automation and build pipeline improvements.

Enhanced User Experience

Improved usability and reliability from the user's perspective, including search experiences and reservation systems.

Infrastructure & Cost Optimization (FinOps)

Optimized cloud expenditures through enhanced monitoring and architectural re-design for long-term sustainability.

Practical AWS Cost Optimization for Large-Scale B2C Services
2023/09/11

Table of Contents

Recovery from MySQL Failure and Stabilization of Write Performance

Summary

Incident overview

Initial response (ensuring availability)

Introduction of read replicas

Additional failover settings

Strengthening metrics and log collection

Investigation and measurement phase

Checking failover occurrences

Analysis of resource metrics

Collecting slow query logs and correlation analysis

Identifying the root cause

Verification phase

Identifying slow queries and analyzing causes

Improvement policy and design

Verification in the staging environment

Verification items

Verification results

Migration plan and recurrence prevention

Migration targets and prioritization

Phased migration policy

Risk management during switchover

Continuous monitoring and recurrence prevention framework

Results and outcomes

Quantitative outcomes

Qualitative outcomes

Reflections and learnings

Technical perspective

Operational and organizational perspective

Future developments

Performance Optimization

Developer Productivity & Quality Automation

Enhanced User Experience

Infrastructure & Cost Optimization (FinOps)

目次