Recovery from MySQL Failure and Stabilization of Write Performance

  • mysql
    mysql
  • aurora
    aurora
  • cloudwatch
    cloudwatch
  • aws
    aws
  • chartjs
    chartjs
  • typescript
    typescript
  • python
    python
Published on 2021/06/13

Summary

Perspective Content
Issue The production DB failed during business hours, causing resend retries to spike and degrading overall response. In addition, visualization was insufficient, so it was impossible to detect the risk of recurrence in advance.
Initial response Immediately set up Aurora automatic failover and read replica distribution, and established continuous monitoring of Slow Query / CloudWatch metrics / error logs.
Investigation Performed correlation analysis of CloudWatch metrics and event logs, and identified that the batch process’s DELETE→INSERT pattern was putting pressure on memory and I/O.
Action Partitioned the table by month and revamped the batch to TRUNCATE + multi-row INSERT. Operationally, established a combined practice of phased migration, instant rollback, and weekly reporting.
Results / Outcomes Suppressed the risk of unplanned failover recurrence. Stabilized batch time and load fluctuation, improving overall system availability and decision-making speed.

Incident overview

The production DB server terminated abnormally during the day (business hours). Responses across the entire application stopped for several minutes.

It was automatically restarted immediately afterward, but there was a temporary spike in load due to transaction retries.

Initial response (ensuring availability)

Introduction of read replicas

Distributed read traffic to Aurora read replicas to reduce load on the primary.
By configuring the system so that the application can fail over to read replicas in the event of a failure, the business impact of any recurrence was minimized.

Additional failover settings

Enabled Aurora’s automatic failover feature.
Configured automatic switchover to a read replica within several tens of seconds when the primary fails, minimizing system downtime.

Strengthening metrics and log collection

Continuously collected the following via CloudWatch Metrics:

  • CPU utilization, memory usage, I/O wait, number of connections, query latency
  • Aurora-specific metrics such as DatabaseConnections, Deadlocks, FreeableMemory, DiskQueueDepth

Additionally, enabled slow_query_log and performance_schema to allow bottleneck analysis at the SQL level.
This improved the accuracy of bottleneck identification for reproducing the failure and preventing recurrence.

Investigation and measurement phase

Checking failover occurrences

  • Retrieved Aurora event logs via CloudWatch Logs and identified failover occurrence times for the past month.
  • Used a boto3 script to automatically scan log streams and detect failover timestamps.
  • Confirmed that occurrences were concentrated during business hours and that this was a reproducible load event.
import boto3, datetime
logs = boto3.client('logs')

end = int(datetime.datetime.utcnow().timestamp())
start = end - 60 * 60 * 24 * 60

resp = logs.filter_log_events(
    logGroupName='/aws/rds/instance/my-aurora-instance/error',
    startTime=start * 1000,
    endTime=end * 1000,
    filterPattern='RDS-EVENT-0169'
)
for e in resp['events']:
    print(e['timestamp'], e['message'])

Analysis of resource metrics

  • Checked resource status immediately before failover using CloudWatch Metrics.
  • Observed a sharp drop in FreeableMemory and identified memory exhaustion as a possible trigger for failover.
  • At the same time, CPU utilization and I/O wait also increased, suggesting concentrated I/O load.
cw = boto3.client('cloudwatch')
metrics = cw.get_metric_statistics(
    Namespace='AWS/RDS',
    MetricName='FreeableMemory',
    Dimensions=[{'Name': 'DBInstanceIdentifier', 'Value': 'my-aurora-instance'}],
    StartTime=datetime.datetime.utcnow() - datetime.timedelta(minutes=10),
    EndTime=datetime.datetime.utcnow(),
    Period=60,
    Statistics=['Average', 'Minimum']
)
print(metrics['Datapoints'])

Collecting slow query logs and correlation analysis

  • Since Aurora automatically forwards slow_query_log to CloudWatch Logs, extracted slow queries from the same log group using boto3.
  • Analyzed queries issued in the few minutes immediately before failover, extracted execution time, table name, and DML type, and compared them with metrics.
  • Analysis confirmed that multiple DELETE and INSERT queries were being issued simultaneously against the same table.
  • In particular, a batch process that fully replaced advertising data at the beginning of the month was running during the relevant time window.
resp = logs.filter_log_events(
    logGroupName='/aws/rds/instance/my-aurora-instance/slowquery',
    startTime=start * 1000,
    endTime=end * 1000,
    filterPattern='"Query_time:"'
)
for e in resp['events']:
    print(e['timestamp'], e['message'])

Identifying the root cause

  • The relevant batch process rebuilt data in a full DELETE → full INSERT pattern,
    consuming large amounts of Aurora’s buffer pool and temporary space.
  • This exhausted memory and caused Aurora to automatically trigger failover.

Verification phase

Identifying slow queries and analyzing causes

  • Identified the advertising log aggregation batch process that was running during the day when the failure occurred.
  • This process repeatedly performed DELETE → INSERT DML operations on large volumes of data for the period from the 1st of each month to the previous day.
  • The large volume of DML caused a rapid increase in memory usage for temporary space, inducing I/O contention and performance delays.
  • Log analysis confirmed that free memory was under pressure during processing.

Improvement policy and design

  • Proposed and implemented monthly partitioning in MySQL.
    • Split each month’s advertising logs into independent partitions and changed the structure so that only the most recent month, which has the highest access frequency, is scanned.
    • Replaced the conventional DELETE → INSERT with TRUNCATE → INSERT (replacing DML with DDL).
    • Prevented memory overuse and transaction log bloat.
    • Objectives:
      • Improve index efficiency
      • Improve I/O locality
      • Reduce DML overhead
CREATE TABLE ad_logs (
  id BIGINT AUTO_INCREMENT PRIMARY KEY,
  ad_id VARCHAR(32) NOT NULL,
  click_count INT NOT NULL,
  impression_count INT NOT NULL,
  log_date DATE NOT NULL
)
ENGINE=InnoDB
PARTITION BY RANGE (TO_DAYS(log_date)) (
  PARTITION p202509 VALUES LESS THAN (TO_DAYS('2025-10-01')),
  PARTITION p202510 VALUES LESS THAN (TO_DAYS('2025-11-01')),
  PARTITION pmax VALUES LESS THAN MAXVALUE
);
  • INSERT processing optimization
    • Changed the batch INSERT for reloading data from row-by-row to multi-row INSERT.
    • Reduced the number of writes per transaction, decreasing network round trips to Aurora and log I/O.
    • In the test environment, confirmed speed improvements compared to single-row INSERT.
    • In some tests, also tried direct loading via LOAD DATA LOCAL INFILE and demonstrated performance gains for bulk loads.

Verification in the staging environment

Sampled production data and verified the effects of partitioning and INSERT optimization. The comparison targets were “before application (DELETE+INSERT pattern)” and “after application (TRUNCATE+multi-row INSERT pattern).”

Verification items

  • Response time for full scans
  • Memory usage and I/O wait time
  • Impact on backup and replication

Verification results

  • The time-consuming DELETE processing completed in under one second with TRUNCATE.
  • Multi-row INSERT shortened the time required for reloading.
  • Significantly reduced memory consumption during DML operations and lowered the incidence of I/O waits.
  • Eliminated the risk of Aurora failover and confirmed stable operation.

Migration plan and recurrence prevention

Migration targets and prioritization

  • Since there were dozens of target tables, defined priorities along three axes: data size, update frequency, and business impact.
  • Performed phased migration starting with large tables where I/O load was concentrated, such as monthly aggregations and ad delivery logs.
  • Prior to migration, inventoried each table’s partitioning status, index structure, and schema compatibility,
    clarified dependencies, and then finalized the migration order.

Phased migration policy

  • Batch-level migration: first targeted tables referenced and updated by batch processes,
    and gradually switched from the DELETE → INSERT pattern to the TRUNCATE → multi-row INSERT pattern.
  • Phased deployment: rolled out to staging and then production with failover detection enabled.
  • Rollback plan: took Aurora snapshots so that the old configuration could be restored immediately if anomalies were detected. Also backed up parameter groups so that any DB settings changed during migration could be reverted as needed.

Risk management during switchover

  • Limited switchover timing to late-night hours when access load is low.
  • Applied settings to immediately perform automatic recovery (failback) if failover is detected.
  • Immediately after migration, monitored slowquery and error logs at 5-minute intervals with CloudWatch Logs Insights to detect query performance anomalies and resource pressure early.

Continuous monitoring and recurrence prevention framework

  • Automation of CloudWatch metrics monitoring
    • Set FreeableMemory, CPUUtilization, DiskQueueDepth, and Deadlocks as continuous monitoring targets.
    • When thresholds are exceeded, automatically send Slack notifications and take snapshots at the same time.
  • Weekly report operations
    • Periodically collected key metrics and slow query counts with a boto3 script.
    • Automatically graphed abnormal trends and quantitatively tracked performance degradation.
  • Monthly maintenance cycle
    • Regularly added partitions, dropped old data, and rebuilt indexes.
    • Continuously analyzed batch execution plans and tuned queries.
  • These mechanisms established a system-level framework to ensure recurrence prevention.

Results and outcomes

Quantitative outcomes

Below are candidate indicators that can be added as quantitative outcomes based on the measures so far (introduction of read replicas / automatic failover / strengthened metrics / partitioning & TRUNCATE + multi-row INSERT, etc.),
along with what to look for and how to measure them.

Indicator What it means (what to look at) How to check it
Number of failovers (monthly average) How often server switchover troubles occur. If it has decreased, it’s evidence that system stability and recovery mechanisms are in place. Check AWS records (event logs) and alerts. Organize by cause (failure / operational error / load). Also look at time to recovery.
Batch processing time Time required for data aggregation and recalculation. If it’s shorter, recovery and periodic processing are more efficient. Automatically record start and end times of processing and track averages and 95th percentile times.
RTO (time until switchover completion) Speed from failure occurrence until service recovery. If it’s stable even during normal business hours, that’s reassuring. Measure the difference between the failure occurrence time and the time when the application becomes usable again.
Minimum memory and fluctuation range Shows whether the server is running with sufficient headroom. If it drops extremely low, there is a possibility of another incident. Use AWS monitoring to compare minimum values and fluctuation ranges over periods.
Disk I/O wait time (P95) Indicates how smoothly data reads and writes are occurring. If it gets slower, it’s a sign that processing congestion is near. Track it over time in monitoring and compare changes with just before the incident.
Share of top slow queries Shows how much “a few slow operations” affect the whole. An indicator of whether you are continuously improving a small number of heavy operations. Produce a ranking of slow operations and track what percentage of the total is accounted for by the top operations.
DDL lock impact time (blocking time) Measures the impact of maintenance work. Indicates whether you are shortening the time during which users “temporarily cannot use” the system. Automatically aggregate the time during which reads and writes were stopped during work.

Qualitative outcomes

  • Visualized the root cause of failover and established a reproducible analysis framework.
  • By reviewing partition design, batch structure, and monitoring design as a whole, transformed operations so that the operations team can “understand and judge” failure trends numerically.
  • Shared knowledge about performance tuning between development and operations, forming a flow that incorporates load testing, monitoring, and tuning into the overall development process.
  • The data platform team institutionalized monthly risk reviews and established an operational process to detect and address failover precursors in advance.

Reflections and learnings

Technical perspective

  • Failover was not just a failure phenomenon, but the result of the overall system load design exceeding its critical point.
    Through correlation analysis of metrics and slow query logs, it was possible to identify “DML design,” rather than “CPU or memory,” as the root cause.

Operational and organizational perspective

  • By organizing incident response as a loop of observation, hypothesis, and verification rather than “solving individual events,” the entire team was able to work on recurrence prevention using a common analytical framework.
  • By integrating CloudWatch, boto3, and Aurora logs into an observability platform, the team moved away from person-dependent troubleshooting. This became the first step toward automating and visualizing the incident response process.

Future developments

  • Aim for an architecture that maintains stable throughput even during traffic peaks by making batch processing asynchronous and introducing queuing (SQS, Lambda integration).
  • Through regular performance reviews, evolve from “operations for monitoring” to “monitoring for improvement.”

Performance Optimization

Enhanced system response speed and stability through database and delivery route optimization.

Developer Productivity & Quality Automation

Maintained continuous development velocity through quality assurance automation and build pipeline improvements.