Asynchronizing Search Index Updates to Improve Peak-Time Resilience

kafka
elasticsearch
golang

Published on 2022/06/11

2025/01/19

This post is also available in .

 Summary

Perspective
Content


Issue
Inventory and price updates were synchronously reflected to ElasticSearch, so write processing was delayed or timed out when load concentrated. Availability became unstable in spike situations such as sales.

Response
Revamped to an asynchronous pipeline using Outbox → Pub/Sub → Indexer. In addition, adopted partial upsert to make updates lighter and apply only minimal diffs.

Outcome
While maintaining stability of the write experience, we were able to reflect updates to ElasticSearch without significant delay. Inventory freshness and search reliability now coexist even under peak load.

Ripple effects
This asynchronous update method became a standard pattern within the company. By rolling it out horizontally to other APIs and projects, we built an update foundation that combines load leveling with ease of extension, continuously reducing operational costs.

 Background / Issues Service architecture and assumptionsThe system is structured like a lodging reservation service, where listings are searched using composite conditions such as “date × number of people × price × amenities × location.”

Inventory and prices change daily, and during weekends or sales both “update events (inventory/price)” and “search requests” increase sharply at the same time.
 Existing mechanism and problems at the timePreviously, inventory and price updates synchronously triggered index updates to ElasticSearch.

With this architecture, the following problems became apparent:
DB locks and ElasticSearch update I/O overlapped, degrading write response times.
When bulk update events concentrated, timeouts and retries cascaded.
 Technical challenges to solveOptimization of partial updates: Structure the data so that only changed attributes such as price or inventory can be upserted as diffs.
Load leveling: Design the system so that it can process large volumes of updates stably, even when many updates occur, such as at the start of a sale or during bulk calendar updates.
 Business objectivesMaintain user experience: Prevent delays and failures in inventory and price updates, keeping operations smooth.
Reliability of the search experience: Ensure search results reflect the latest state, preventing sold-out items or incorrect displays.
 Approach Basic policy (core of peak load leveling)Eliminate synchronous updates and move to asynchrony

We revisited the architecture where the app directly updated ElasticSearch and introduced an Outbox → Pub/Sub → Indexer pipeline.

By making write processing respond immediately and separating ElasticSearch updates into asynchronous processing, we maintain stable throughput even at peak times.
Lightweight updates via partial upsert

Only changed attributes such as price, inventory, and amenities are upserted as diffs. This avoids full reindexing and minimizes ElasticSearch update load and reflection delay.
Maintain consistency via idempotency, last-write-wins, and reprocessing

Each event is given an event_id and version/updated_at to prevent duplicate application and out-of-order updates.

We built in retry, DLQ, and backfill reprocessing so that even with temporary failures, eventual consistency is guaranteed.
 Operations and monitoring policyPhased release

Introduce the new flow step by step for each update type (e.g., bulk calendar updates), control it with feature flags, and observe load characteristics.
Strengthened observability

Continuously monitor freshness from event occurrence to ElasticSearch reflection, queue backlog, and convergence of reprocessing.
Ensuring scalability

Design the system on the assumption that Pub/Sub and Indexer can scale horizontally and ElasticSearch shard configuration can be expanded to prepare for sudden traffic spikes.
System architecture (Before)
System architecture (After)
 Investigation and measurement phase Objectives (What to prove)Write-side response times remain stably maintained even at peak times.
Reflections to the search index continue to complete while maintaining a certain level of freshness.
Any out-of-order, duplicate, or missing events that may occur due to asynchronous processing will still converge to a consistent state, including via reprocessing.
 Metrics design (definitions and measurement methods)Write latency: Measure inside the app plus APM traces to obtain the latency distribution (p50/p95/p99) of requests.
Reflection freshness (event → ElasticSearch reflection)
Definition: index_lag = t(indexed_at) - t(event_occurred_at)
Collection: Record occurred_at in the Outbox and indexed_at when the Indexer completes, then correlate them per entity ID.

Quality (accuracy)
Update events are processed without duplication, out-of-order, or omission.
Even when failures occur, reprocessing brings the system to eventual consistency.
The event state (received, processed, reflected) is traceable across the entire processing path.

Operational health: Visualize queue backlog length, consumption rate, DLQ count, reprocessing success rate, and retry count distribution.
 Baseline measurement (Before)Items collected:
Write latency and timeout rate when reflecting synchronously
Index reflection freshness with synchronous reflection (effectively immediate) and failure behavior under spikes

 Load model (mimicking realistic usage)Write events:
Moderate update frequency during normal times, increasing several-fold at peak times such as sales or the start of weekends.
Breakdown: price calendar updates are the majority (about 60%), inventory reservation/release (about 30%), listing information changes (about 10%).

Search traffic:
Increases together with writes (weekend/holiday scenarios).

 Migration work Gradual releaseInternally, we added appends to the Outbox at DB update time, and used those events to have the new asynchronous path (Indexer) update a validation ElasticSearch cluster.
For a certain period, the old path (synchronous updates) ran in parallel, and we compared and monitored the reflection results of both.
 Switching the operations and monitoring setupVisualized Pub/Sub, Indexer, and DLQ in monitoring tools, and created dashboards for processing delay (lag), backlog count, and failure rate.
Also monitored reflection delay on the ElasticSearch side and set up automatic notifications (alert policies) when thresholds were exceeded.
 Feeding back verification resultsIdentify diffs and determine “which attributes have more discrepancies (price or inventory)”
Tune update logic
After confirming that index reflection accuracy had stabilized, we moved to full production use.
 Release completion criteriaStop the synchronous path for all update types and fully migrate to the asynchronous pipeline
 Results Quantitative outcomes Performance (stabilizing writes)Spikes (latency increases under high load) that occurred with synchronous updates were eliminated.
Even during sales and weekend peaks, write API response times remained stable within a certain range.
Timeout and retry rates decreased significantly, preserving the user operation experience.
 Availability and throughput (peak load leveling)By separating ElasticSearch update processing asynchronously, the app side can now absorb CPU and I/O load.
Queue processing throughput can be scaled out dynamically,

enabling the system to handle temporary traffic concentration without processing delays.
DLQ and reprocessing jobs run stably, achieving rapid self-recovery.
 Accuracy (maintaining consistency)Idempotent processing and last-write-wins control via version / updated_at worked as intended,

and misupdates due to out-of-order or duplicate application converged to zero.
Cross-checks with the old (synchronous) path showed that the difference rate was extremely low and stable.
 Qualitative outcomes Improved development and operations experienceIncreased operational peace of mind:

Separating ElasticSearch updates from the core app significantly reduced alert frequency at peak times.

Concerns such as “the DB might be clogged” were alleviated, allowing the team to spend more time on improvements rather than incident response.
Psychological safety through phased release

By having a dual period of synchronous and asynchronous paths and visualizing reflection diffs during the switchover, a shared understanding took hold across the organization that “we migrate only after understanding the risks.”
 Improved user experience and operational efficiencyStable operation response

The perceived immediacy of price and inventory updates was maintained, enabling users to confidently and frequently adjust prices and manage inventory from the admin console.
Reduced support load

Fewer update delays and reflection failures led to fewer inquiries to the support team. As a result, we achieved both reduced operational costs and improved customer satisfaction.
 Ripple effects on the organization and technical foundationStandardization of asynchronous processing design principles within the company

Triggered by this initiative, the Outbox → Pub/Sub → Worker architecture was rolled out to other update APIs (notifications, email sending, aggregation processing, etc.).

The architectural principle of “not holding updates synchronously in the app” is now shared and reused across teams.
 Future developmentsThe Outbox → Pub/Sub → Worker model established here can be applied to asynchronous processing not only for index updates but also for the following:
Cache updates and CDN purge
Notifications (email / SMS / push)
External API integrations (payments, CRM, MA, etc.)
Ingestion of aggregation/analytics events
Image processing and thumbnail generation
Collection of audit logs and activity logs
Generation of suggestion dictionaries and auxiliary data
ML feature updates
Gradual state transitions (e.g., order workflows)
By progressively offloading “areas where synchronous processing can become a source of delay” to queues, we can promote long-term, system-wide load leveling and availability improvements.

Perspective	Content
Issue	Inventory and price updates were synchronously reflected to ElasticSearch, so write processing was delayed or timed out when load concentrated. Availability became unstable in spike situations such as sales.
Response	Revamped to an asynchronous pipeline using Outbox → Pub/Sub → Indexer. In addition, adopted partial upsert to make updates lighter and apply only minimal diffs.
Outcome	While maintaining stability of the write experience, we were able to reflect updates to ElasticSearch without significant delay. Inventory freshness and search reliability now coexist even under peak load.
Ripple effects	This asynchronous update method became a standard pattern within the company. By rolling it out horizontally to other APIs and projects, we built an update foundation that combines load leveling with ease of extension, continuously reducing operational costs.

Performance Optimization

Enhanced system response speed and stability through database and delivery route optimization.

Developer Productivity & Quality Automation

Maintained continuous development velocity through quality assurance automation and build pipeline improvements.

Enhanced User Experience

Improved usability and reliability from the user's perspective, including search experiences and reservation systems.

Infrastructure & Cost Optimization (FinOps)

Optimized cloud expenditures through enhanced monitoring and architectural re-design for long-term sustainability.

Practical AWS Cost Optimization for Large-Scale B2C Services
2023/09/11

Table of Contents

Asynchronizing Search Index Updates to Improve Peak-Time Resilience

Summary

Background / Issues

Service architecture and assumptions

Existing mechanism and problems at the time

Technical challenges to solve

Business objectives

Approach

Basic policy (core of peak load leveling)

Operations and monitoring policy

Investigation and measurement phase

Objectives (What to prove)

Metrics design (definitions and measurement methods)

Baseline measurement (Before)

Load model (mimicking realistic usage)

Migration work

Gradual release

Switching the operations and monitoring setup

Feeding back verification results

Release completion criteria

Results

Quantitative outcomes

Performance (stabilizing writes)

Availability and throughput (peak load leveling)

Accuracy (maintaining consistency)

Qualitative outcomes

Improved development and operations experience

Improved user experience and operational efficiency

Ripple effects on the organization and technical foundation

Future developments

Performance Optimization

Developer Productivity & Quality Automation

Enhanced User Experience

Infrastructure & Cost Optimization (FinOps)

目次