Maintain inventory consistency and sales opportunities by automatically releasing expired reservations

Published on 2022/09/13

2023/01/18

This post is also available in 日本語.

 Summary

Perspective
Details


Issue
Some reservations were not released and remained, causing a state where inventory “appeared fully booked.” This led to lost sales opportunities and required staff to manually release reservations.

Response
Introduced a mechanism to automatically release expired reservations using Redis TTL (expiration) and background jobs.
・ Controlled with a unique key to prevent duplicate processing of the same reservation.
・ Redis acts as the trigger, while the actual inventory updates are handled by the DB.

Visualization
Turned the number of pending reservations and the elapsed time until release into metrics, and constantly monitored behavior.

Outcome
Unreleased reservations were improved to almost zero. Inventory is now updated quickly and accurately, preventing loss of sales opportunities.

Effect
Reservation API responses became more stable and lock waits decreased. Manual release work became almost unnecessary.

 Background and issuesIn the reservation system, “reservations that continue to occupy inventory without completed payment,” so‑called “ghost reservations,” had become a problem.

Even when users left items or seats in their cart and abandoned it, or when an error occurred during payment, the inventory remained in a “held” state, reducing the number of slots available for sale.
This impact was particularly pronounced during sales and event sales peaks, where:
It showed “fully booked” even though there were actually available slots
Customers revisiting could not make reservations and left
The operations team had to manually release pending reservations via the admin screen, creating high operational load
These situations were occurring.
As a result, there was a triple loss: lost sales opportunities, deterioration of inventory turnover, and decreased customer satisfaction.

On the other hand, if automatic release was too aggressive, there was a risk of mistakenly deleting reservations of users who were actually in the middle of payment, so “designing the release timing” and “safe monitoring and detection” were challenges.
 Investigation and measurement phaseIn this phase, two points were investigated in particular:
To what extent ghost reservations were occurring and how they were affecting inventory and sales opportunities.
What an appropriate TTL (reservation hold time) would be as the basis for automatic release.
First, analysis of the current system’s reservation data showed that many cases existed where reservations without completed payment remained for a long time, making inventory appear lower than it actually was.
Two particularly prominent patterns were:
Cases where users left during payment and never returned
Cases where the payment API failed due to communication errors and remained without being retried
These were confirmed by cross‑checking application logs and payment logs.

In addition, there was no mechanism to automatically detect pending reservations, so staff had to delete them manually from the admin screen, resulting in a high operational burden.
At the same time, to consider the initial TTL setting, we analyzed production data such as the time required from provisional reservation to payment completion and the return rate after abandonment.

We visualized “how long it takes for most users to come back” and used this as a basis for deciding the initial TTL value.
Through this, we achieved two objectives:
Grasped the actual state of ghost reservations and clarified how much a release mechanism was needed.
Obtained basic data to set an appropriate TTL.
 Design and implementation phaseObjective: Design a mechanism to automatically detect and release ghost reservations, minimizing loss of sales opportunities.
Architecture design
Separate configuration of Reservation API, DB, Redis, job workers, and monitoring platform
Combine TTL management with event‑driven automatic release



Component
Role
Notes


Reservation API (App layer)
Accepts reservation operations from users. Handles synchronous processing such as securing reservations, starting payments, and cancellations.
Prioritizes immediate response. Called directly from outside.

DB (Persistent layer)
Stores reservation headers/details/inventory in normalized form. Guarantees state transitions with consistent transactions.
MySQL, etc. Center of consistency.

Redis (Cache/TTL layer)
Handles temporary reservation holds (TTL keys), inventory counters, and job queue control.
Used for “expiration detection” and “lightweight locks.”

Job workers (Async layer)
Processes asynchronous tasks such as TTL expiration, payment failures, and expiration releases.
Scheduled with periodic jobs + delayed queues.

Monitoring platform (Observability layer)
Visualizes pending counts, processing time, and erroneous release rate.
Axis for alerts, analysis, and operational improvement.

 Pattern ①: Normal flow where reservation and payment complete successfullyUser makes a reservation → inventory is confirmed upon payment completion → TTL key is deleted.
 Pattern ②: Payment not completed (ghost reservation) → recovered by automatic release jobReservations that are not paid trigger an event when TTL expires → job returns inventory.
Data management
Each reservation has an expiration (hold_expires_at), and duplicate processing is prevented by keys.
States are managed in the flow HOLD → PAYING → CONFIRMED / EXPIRED.
Monitoring and alert design
Measure pending count, release job latency, lock wait time, and erroneous release rate
 Verification ObjectiveConfirm that the automatic release job works correctly and can reliably return inventory.

Emphasis was placed on avoiding erroneous releases and ensuring that the overall system does not stop even if delays or lock contention occur.
 Verification itemsCorrectness: Whether inventory is updated correctly across various state transitions such as reservation, cancellation, and incomplete payment.
Consistency: Whether double booking or overselling occurs.
Stability: Whether processing delays and lock contention at peak times remain within acceptable ranges.
Availability: Whether the system ultimately recovers correctly based on the DB even if Redis or jobs temporarily stop.
Observability: Whether pending counts and delays can be checked on dashboards and notifications are sent in case of anomalies.
 Verification methodsExecuted scenarios for both normal and abnormal cases (abandonment, payment delay, cancellation, fault injection) and verified reservation states and inventory counts against each other.
Used load testing tools to reproduce loads close to actual traffic volume. Confirmed that automatic release did not lag and consistency was maintained even under high load.
Confirmed that inventory is returned via both Redis TTL and DB scan paths.
Empirically confirmed duplicate prevention via idempotency keys and verified that multiple processing of the same event did not occur.
 ResultsConfirmed that no erroneous releases or overselling occurred and that inventory was automatically and correctly returned.
Although job delays and lock contention did occur, there was no impact on the overall system, and metrics remained within normal ranges on the monitoring dashboard.
 Results and outcomes Quantitative resultsGhost reservations: Long‑lasting reservations were almost completely eliminated by automatic release.
Inventory consistency: Differences between screen display and actual inventory almost disappeared.
Response stability: Reservation API responses remained stable even at peak times, and processing delays became less likely.
Operational effort: Manual inventory release work became unnecessary.
 Qualitative effectsRecovery of sales opportunities: “Apparent full booking” of inventory was eliminated, reducing sales loss.
Enhanced visualization: Pending counts and release timing can now be constantly checked on dashboards.
 References Comparison of TTL management methodsIn this case, Redis was used for reservation holding, but if it cannot be introduced or if persistence is prioritized, the following two methods are typical.
 Pattern A: DB‑driven TTL management (periodic job scan)Configuration
Add hold_expires_at to the reservation table.
Periodically, a batch/job worker searches with WHERE NOW() > hold_expires_at and releases expired reservations.
Advantages
No Redis required; consistency can be guaranteed with persistent data only.
State does not disappear after restarts or failure recovery.
Disadvantages
Delay (depends on batch interval)
Scanning large amounts of data increases DB load at peak times.
Difficult to “immediately resume sales right after expiration.”
Use cases
Cases where reliability is prioritized over immediacy, such as low‑frequency B2B reservations rather than tickets or hotels.
 Pattern B: Job‑queue‑driven delayed executionConfiguration
Register a “release job” in the queue at the time of provisional hold.
Example: enqueue(ReleaseJob, delay: 15.minutes, reservation_id: 1234)
Message queue (SQS, RabbitMQ, Sidekiq, etc.) fires the job after 15 minutes.
Advantages
Does not depend on Redis and can leverage the job system’s retry guarantees and persistence.
Delayed execution is accurate and requires no DB scan.
Disadvantages
Delay precision of message queues is on the order of seconds, slightly slower than Redis.
If many queues are generated, monitoring and retry management become complex.
Use cases
Job‑driven architectures
 Comparison

Item
Redis TTL method
DB scan method
Delayed job method


Immediacy
◎ (almost real‑time)
△ (depends on batch interval)
○ (seconds to tens of seconds)

Persistence
△ (volatile)
◎
◎

Structural simplicity
○
○
△ (depends on job system)

Load characteristics
Low (memory)
High (I/O‑intensive)
Medium

Failure recovery
Regeneration required
Persisted
Retry guarantees

Suitable cases
High‑frequency, immediacy‑oriented
Reliability/history‑oriented
Event‑driven/distributed environments

 Handling payment Webhook failuresWebhooks from payment providers can experience “delays,” “duplicates,” “out‑of‑order delivery,” and “retries.” Sequence diagrams below summarize how consistency is maintained in each pattern.
 Duplicate delivery Payment success arrives shortly after TTL expirationPayment successes that occur immediately after TTL expiration are treated as “updates within the grace period,” skipping release and prioritizing confirmation processing.
 Success is significantly delayed after TTL expirationPayments that occur after expiration do not confirm the reservation; instead, they are switched to refund or re‑reservation guidance.
 Out‑of‑order events (failure arrives first, success later)In payment systems, it can actually happen that payment_failed arrives first and payment_succeeded arrives later with a delay.

In this case, we do not “always confirm with the later event winning,” but only transition to confirmed when the following conditions are met:
payment_succeeded.occurred_at is within hold_expires_at
And the relevant inventory has not yet been secured by another reservation (inventory still available)
If either condition is not met, the reservation remains EXPIRED and the success event is routed to refund or re‑reservation guidance.

This prevents erroneous confirmations and overbooking due to out‑of‑order events.
 User cancels before paymentNote: 200 OK to the Webhook means “this event does not need to be resent,” and is not the refund process itself. Refunds and invalidations proceed separately as business processes in the background, via queue registration or DB records.

Performance Optimization

Enhanced system response speed and stability through database and delivery route optimization.

Developer Productivity & Quality Automation

Maintained continuous development velocity through quality assurance automation and build pipeline improvements.

Enhanced User Experience

Improved usability and reliability from the user's perspective, including search experiences and reservation systems.

Infrastructure & Cost Optimization (FinOps)

Optimized cloud expenditures through enhanced monitoring and architectural re-design for long-term sustainability.

Practical AWS Cost Optimization for Large-Scale B2C Services
2023/09/11

Perspective	Details
Issue	Some reservations were not released and remained, causing a state where inventory “appeared fully booked.” This led to lost sales opportunities and required staff to manually release reservations.
Response	Introduced a mechanism to automatically release expired reservations using Redis TTL (expiration) and background jobs. ・ Controlled with a unique key to prevent duplicate processing of the same reservation. ・ Redis acts as the trigger, while the actual inventory updates are handled by the DB.
Visualization	Turned the number of pending reservations and the elapsed time until release into metrics, and constantly monitored behavior.
Outcome	Unreleased reservations were improved to almost zero. Inventory is now updated quickly and accurately, preventing loss of sales opportunities.
Effect	Reservation API responses became more stable and lock waits decreased. Manual release work became almost unnecessary.

Component	Role	Notes
Reservation API (App layer)	Accepts reservation operations from users. Handles synchronous processing such as securing reservations, starting payments, and cancellations.	Prioritizes immediate response. Called directly from outside.
DB (Persistent layer)	Stores reservation headers/details/inventory in normalized form. Guarantees state transitions with consistent transactions.	MySQL, etc. Center of consistency.
Redis (Cache/TTL layer)	Handles temporary reservation holds (TTL keys), inventory counters, and job queue control.	Used for “expiration detection” and “lightweight locks.”
Job workers (Async layer)	Processes asynchronous tasks such as TTL expiration, payment failures, and expiration releases.	Scheduled with periodic jobs + delayed queues.
Monitoring platform (Observability layer)	Visualizes pending counts, processing time, and erroneous release rate.	Axis for alerts, analysis, and operational improvement.

Item	Redis TTL method	DB scan method	Delayed job method
Immediacy	◎ (almost real‑time)	△ (depends on batch interval)	○ (seconds to tens of seconds)
Persistence	△ (volatile)	◎	◎
Structural simplicity	○	○	△ (depends on job system)
Load characteristics	Low (memory)	High (I/O‑intensive)	Medium
Failure recovery	Regeneration required	Persisted	Retry guarantees
Suitable cases	High‑frequency, immediacy‑oriented	Reliability/history‑oriented	Event‑driven/distributed environments