Asynchronizing Search Index Updates to Improve Peak-Time Resilience
Summary
| Perspective | Content |
|---|---|
| Issue | Inventory and price updates were synchronously reflected to ElasticSearch, so write processing was delayed or timed out when load concentrated. Availability became unstable in spike situations such as sales. |
| Response | Revamped to an asynchronous pipeline using Outbox → Pub/Sub → Indexer. In addition, adopted partial upsert to make updates lighter and apply only minimal diffs. |
| Outcome | While maintaining stability of the write experience, we were able to reflect updates to ElasticSearch without significant delay. Inventory freshness and search reliability now coexist even under peak load. |
| Ripple effects | This asynchronous update method became a standard pattern within the company. By rolling it out horizontally to other APIs and projects, we built an update foundation that combines load leveling with ease of extension, continuously reducing operational costs. |
Background / Issues
Service architecture and assumptions
The system is structured like a lodging reservation service, where listings are searched using composite conditions such as “date × number of people × price × amenities × location.”
Inventory and prices change daily, and during weekends or sales both “update events (inventory/price)” and “search requests” increase sharply at the same time.
Existing mechanism and problems at the time
Previously, inventory and price updates synchronously triggered index updates to ElasticSearch.
With this architecture, the following problems became apparent:
- DB locks and ElasticSearch update I/O overlapped, degrading write response times.
- When bulk update events concentrated, timeouts and retries cascaded.
Technical challenges to solve
- Optimization of partial updates: Structure the data so that only changed attributes such as price or inventory can be upserted as diffs.
- Load leveling: Design the system so that it can process large volumes of updates stably, even when many updates occur, such as at the start of a sale or during bulk calendar updates.
Business objectives
- Maintain user experience: Prevent delays and failures in inventory and price updates, keeping operations smooth.
- Reliability of the search experience: Ensure search results reflect the latest state, preventing sold-out items or incorrect displays.
Approach
Basic policy (core of peak load leveling)
-
Eliminate synchronous updates and move to asynchrony
We revisited the architecture where the app directly updated ElasticSearch and introduced an Outbox → Pub/Sub → Indexer pipeline.
By making write processing respond immediately and separating ElasticSearch updates into asynchronous processing, we maintain stable throughput even at peak times. -
Lightweight updates via partial upsert
Only changed attributes such as price, inventory, and amenities are upserted as diffs. This avoids full reindexing and minimizes ElasticSearch update load and reflection delay. -
Maintain consistency via idempotency, last-write-wins, and reprocessing
Each event is given anevent_idandversion/updated_atto prevent duplicate application and out-of-order updates.
We built in retry, DLQ, and backfill reprocessing so that even with temporary failures, eventual consistency is guaranteed.
Operations and monitoring policy
-
Phased release
Introduce the new flow step by step for each update type (e.g., bulk calendar updates), control it with feature flags, and observe load characteristics. -
Strengthened observability
Continuously monitor freshness from event occurrence to ElasticSearch reflection, queue backlog, and convergence of reprocessing. -
Ensuring scalability
Design the system on the assumption that Pub/Sub and Indexer can scale horizontally and ElasticSearch shard configuration can be expanded to prepare for sudden traffic spikes.
System architecture (Before)
System architecture (After)
Investigation and measurement phase
Objectives (What to prove)
- Write-side response times remain stably maintained even at peak times.
- Reflections to the search index continue to complete while maintaining a certain level of freshness.
- Any out-of-order, duplicate, or missing events that may occur due to asynchronous processing will still converge to a consistent state, including via reprocessing.
Metrics design (definitions and measurement methods)
- Write latency: Measure inside the app plus APM traces to obtain the latency distribution (p50/p95/p99) of requests.
- Reflection freshness (event → ElasticSearch reflection)
- Definition:
index_lag = t(indexed_at) - t(event_occurred_at) - Collection: Record
occurred_atin the Outbox andindexed_atwhen the Indexer completes, then correlate them per entity ID.
- Definition:
- Quality (accuracy)
- Update events are processed without duplication, out-of-order, or omission.
- Even when failures occur, reprocessing brings the system to eventual consistency.
- The event state (received, processed, reflected) is traceable across the entire processing path.
- Operational health: Visualize queue backlog length, consumption rate, DLQ count, reprocessing success rate, and retry count distribution.
Baseline measurement (Before)
- Items collected:
- Write latency and timeout rate when reflecting synchronously
- Index reflection freshness with synchronous reflection (effectively immediate) and failure behavior under spikes
Load model (mimicking realistic usage)
- Write events:
- Moderate update frequency during normal times, increasing several-fold at peak times such as sales or the start of weekends.
- Breakdown: price calendar updates are the majority (about 60%), inventory reservation/release (about 30%), listing information changes (about 10%).
- Search traffic:
- Increases together with writes (weekend/holiday scenarios).
Migration work
Gradual release
- Internally, we added appends to the Outbox at DB update time, and used those events to have the new asynchronous path (Indexer) update a validation ElasticSearch cluster.
- For a certain period, the old path (synchronous updates) ran in parallel, and we compared and monitored the reflection results of both.
Switching the operations and monitoring setup
- Visualized Pub/Sub, Indexer, and DLQ in monitoring tools, and created dashboards for processing delay (lag), backlog count, and failure rate.
- Also monitored reflection delay on the ElasticSearch side and set up automatic notifications (alert policies) when thresholds were exceeded.
Feeding back verification results
- Identify diffs and determine “which attributes have more discrepancies (price or inventory)”
- Tune update logic
- After confirming that index reflection accuracy had stabilized, we moved to full production use.
Release completion criteria
- Stop the synchronous path for all update types and fully migrate to the asynchronous pipeline
Results
Quantitative outcomes
Performance (stabilizing writes)
- Spikes (latency increases under high load) that occurred with synchronous updates were eliminated.
- Even during sales and weekend peaks, write API response times remained stable within a certain range.
- Timeout and retry rates decreased significantly, preserving the user operation experience.
Availability and throughput (peak load leveling)
- By separating ElasticSearch update processing asynchronously, the app side can now absorb CPU and I/O load.
- Queue processing throughput can be scaled out dynamically,
enabling the system to handle temporary traffic concentration without processing delays. - DLQ and reprocessing jobs run stably, achieving rapid self-recovery.
Accuracy (maintaining consistency)
- Idempotent processing and last-write-wins control via version / updated_at worked as intended,
and misupdates due to out-of-order or duplicate application converged to zero. - Cross-checks with the old (synchronous) path showed that the difference rate was extremely low and stable.
Qualitative outcomes
Improved development and operations experience
-
Increased operational peace of mind:
Separating ElasticSearch updates from the core app significantly reduced alert frequency at peak times.
Concerns such as “the DB might be clogged” were alleviated, allowing the team to spend more time on improvements rather than incident response. -
Psychological safety through phased release
By having a dual period of synchronous and asynchronous paths and visualizing reflection diffs during the switchover, a shared understanding took hold across the organization that “we migrate only after understanding the risks.”
Improved user experience and operational efficiency
- Stable operation response
The perceived immediacy of price and inventory updates was maintained, enabling users to confidently and frequently adjust prices and manage inventory from the admin console. - Reduced support load
Fewer update delays and reflection failures led to fewer inquiries to the support team. As a result, we achieved both reduced operational costs and improved customer satisfaction.
Ripple effects on the organization and technical foundation
- Standardization of asynchronous processing design principles within the company
Triggered by this initiative, theOutbox → Pub/Sub → Workerarchitecture was rolled out to other update APIs (notifications, email sending, aggregation processing, etc.).
The architectural principle of “not holding updates synchronously in the app” is now shared and reused across teams.
Future developments
The Outbox → Pub/Sub → Worker model established here can be applied to asynchronous processing not only for index updates but also for the following:
- Cache updates and CDN purge
- Notifications (email / SMS / push)
- External API integrations (payments, CRM, MA, etc.)
- Ingestion of aggregation/analytics events
- Image processing and thumbnail generation
- Collection of audit logs and activity logs
- Generation of suggestion dictionaries and auxiliary data
- ML feature updates
- Gradual state transitions (e.g., order workflows)
By progressively offloading “areas where synchronous processing can become a source of delay” to queues, we can promote long-term, system-wide load leveling and availability improvements.
Performance Optimization
Enhanced system response speed and stability through database and delivery route optimization.
Developer Productivity & Quality Automation
Maintained continuous development velocity through quality assurance automation and build pipeline improvements.
Enhanced User Experience
Improved usability and reliability from the user's perspective, including search experiences and reservation systems.