Maintain inventory consistency and sales opportunities by automatically releasing expired reservations
Summary
| Perspective | Details |
|---|---|
| Issue | Some reservations were not released and remained, causing a state where inventory “appeared fully booked.” This led to lost sales opportunities and required staff to manually release reservations. |
| Response | Introduced a mechanism to automatically release expired reservations using Redis TTL (expiration) and background jobs. ・ Controlled with a unique key to prevent duplicate processing of the same reservation. ・ Redis acts as the trigger, while the actual inventory updates are handled by the DB. |
| Visualization | Turned the number of pending reservations and the elapsed time until release into metrics, and constantly monitored behavior. |
| Outcome | Unreleased reservations were improved to almost zero. Inventory is now updated quickly and accurately, preventing loss of sales opportunities. |
| Effect | Reservation API responses became more stable and lock waits decreased. Manual release work became almost unnecessary. |
Background and issues
In the reservation system, “reservations that continue to occupy inventory without completed payment,” so‑called “ghost reservations,” had become a problem.
Even when users left items or seats in their cart and abandoned it, or when an error occurred during payment, the inventory remained in a “held” state, reducing the number of slots available for sale.
This impact was particularly pronounced during sales and event sales peaks, where:
- It showed “fully booked” even though there were actually available slots
- Customers revisiting could not make reservations and left
- The operations team had to manually release pending reservations via the admin screen, creating high operational load
These situations were occurring.
As a result, there was a triple loss: lost sales opportunities, deterioration of inventory turnover, and decreased customer satisfaction.
On the other hand, if automatic release was too aggressive, there was a risk of mistakenly deleting reservations of users who were actually in the middle of payment, so “designing the release timing” and “safe monitoring and detection” were challenges.
Investigation and measurement phase
In this phase, two points were investigated in particular:
- To what extent ghost reservations were occurring and how they were affecting inventory and sales opportunities.
- What an appropriate TTL (reservation hold time) would be as the basis for automatic release.
First, analysis of the current system’s reservation data showed that many cases existed where reservations without completed payment remained for a long time, making inventory appear lower than it actually was.
Two particularly prominent patterns were:
- Cases where users left during payment and never returned
- Cases where the payment API failed due to communication errors and remained without being retried
These were confirmed by cross‑checking application logs and payment logs.
In addition, there was no mechanism to automatically detect pending reservations, so staff had to delete them manually from the admin screen, resulting in a high operational burden.
At the same time, to consider the initial TTL setting, we analyzed production data such as the time required from provisional reservation to payment completion and the return rate after abandonment.
We visualized “how long it takes for most users to come back” and used this as a basis for deciding the initial TTL value.
Through this, we achieved two objectives:
- Grasped the actual state of ghost reservations and clarified how much a release mechanism was needed.
- Obtained basic data to set an appropriate TTL.
Design and implementation phase
Objective: Design a mechanism to automatically detect and release ghost reservations, minimizing loss of sales opportunities.
- Architecture design
- Separate configuration of Reservation API, DB, Redis, job workers, and monitoring platform
- Combine TTL management with event‑driven automatic release
| Component | Role | Notes |
|---|---|---|
| Reservation API (App layer) | Accepts reservation operations from users. Handles synchronous processing such as securing reservations, starting payments, and cancellations. | Prioritizes immediate response. Called directly from outside. |
| DB (Persistent layer) | Stores reservation headers/details/inventory in normalized form. Guarantees state transitions with consistent transactions. | MySQL, etc. Center of consistency. |
| Redis (Cache/TTL layer) | Handles temporary reservation holds (TTL keys), inventory counters, and job queue control. | Used for “expiration detection” and “lightweight locks.” |
| Job workers (Async layer) | Processes asynchronous tasks such as TTL expiration, payment failures, and expiration releases. | Scheduled with periodic jobs + delayed queues. |
| Monitoring platform (Observability layer) | Visualizes pending counts, processing time, and erroneous release rate. | Axis for alerts, analysis, and operational improvement. |
Pattern ①: Normal flow where reservation and payment complete successfully
User makes a reservation → inventory is confirmed upon payment completion → TTL key is deleted.
Pattern ②: Payment not completed (ghost reservation) → recovered by automatic release job
Reservations that are not paid trigger an event when TTL expires → job returns inventory.
-
Data management
- Each reservation has an expiration (
hold_expires_at), and duplicate processing is prevented by keys. - States are managed in the flow
HOLD→PAYING→CONFIRMED/EXPIRED.
- Each reservation has an expiration (
-
Monitoring and alert design
- Measure pending count, release job latency, lock wait time, and erroneous release rate
Verification
Objective
Confirm that the automatic release job works correctly and can reliably return inventory.
Emphasis was placed on avoiding erroneous releases and ensuring that the overall system does not stop even if delays or lock contention occur.
Verification items
- Correctness: Whether inventory is updated correctly across various state transitions such as reservation, cancellation, and incomplete payment.
- Consistency: Whether double booking or overselling occurs.
- Stability: Whether processing delays and lock contention at peak times remain within acceptable ranges.
- Availability: Whether the system ultimately recovers correctly based on the DB even if Redis or jobs temporarily stop.
- Observability: Whether pending counts and delays can be checked on dashboards and notifications are sent in case of anomalies.
Verification methods
- Executed scenarios for both normal and abnormal cases (abandonment, payment delay, cancellation, fault injection) and verified reservation states and inventory counts against each other.
- Used load testing tools to reproduce loads close to actual traffic volume. Confirmed that automatic release did not lag and consistency was maintained even under high load.
- Confirmed that inventory is returned via both Redis TTL and DB scan paths.
- Empirically confirmed duplicate prevention via idempotency keys and verified that multiple processing of the same event did not occur.
Results
- Confirmed that no erroneous releases or overselling occurred and that inventory was automatically and correctly returned.
- Although job delays and lock contention did occur, there was no impact on the overall system, and metrics remained within normal ranges on the monitoring dashboard.
Results and outcomes
Quantitative results
- Ghost reservations: Long‑lasting reservations were almost completely eliminated by automatic release.
- Inventory consistency: Differences between screen display and actual inventory almost disappeared.
- Response stability: Reservation API responses remained stable even at peak times, and processing delays became less likely.
- Operational effort: Manual inventory release work became unnecessary.
Qualitative effects
- Recovery of sales opportunities: “Apparent full booking” of inventory was eliminated, reducing sales loss.
- Enhanced visualization: Pending counts and release timing can now be constantly checked on dashboards.
References
Comparison of TTL management methods
In this case, Redis was used for reservation holding, but if it cannot be introduced or if persistence is prioritized, the following two methods are typical.
Pattern A: DB‑driven TTL management (periodic job scan)
-
Configuration
- Add
hold_expires_atto the reservation table. - Periodically, a batch/job worker searches with
WHERE NOW() > hold_expires_atand releases expired reservations.
- Add
-
Advantages
- No Redis required; consistency can be guaranteed with persistent data only.
- State does not disappear after restarts or failure recovery.
-
Disadvantages
- Delay (depends on batch interval)
- Scanning large amounts of data increases DB load at peak times.
- Difficult to “immediately resume sales right after expiration.”
-
Use cases
- Cases where reliability is prioritized over immediacy, such as low‑frequency B2B reservations rather than tickets or hotels.
Pattern B: Job‑queue‑driven delayed execution
-
Configuration
- Register a “release job” in the queue at the time of provisional hold.
- Example:
enqueue(ReleaseJob, delay: 15.minutes, reservation_id: 1234) - Message queue (SQS, RabbitMQ, Sidekiq, etc.) fires the job after 15 minutes.
-
Advantages
- Does not depend on Redis and can leverage the job system’s retry guarantees and persistence.
- Delayed execution is accurate and requires no DB scan.
-
Disadvantages
- Delay precision of message queues is on the order of seconds, slightly slower than Redis.
- If many queues are generated, monitoring and retry management become complex.
-
Use cases
- Job‑driven architectures
Comparison
| Item | Redis TTL method | DB scan method | Delayed job method |
|---|---|---|---|
| Immediacy | ◎ (almost real‑time) | △ (depends on batch interval) | ○ (seconds to tens of seconds) |
| Persistence | △ (volatile) | ◎ | ◎ |
| Structural simplicity | ○ | ○ | △ (depends on job system) |
| Load characteristics | Low (memory) | High (I/O‑intensive) | Medium |
| Failure recovery | Regeneration required | Persisted | Retry guarantees |
| Suitable cases | High‑frequency, immediacy‑oriented | Reliability/history‑oriented | Event‑driven/distributed environments |
Handling payment Webhook failures
Webhooks from payment providers can experience “delays,” “duplicates,” “out‑of‑order delivery,” and “retries.” Sequence diagrams below summarize how consistency is maintained in each pattern.
Duplicate delivery
Payment success arrives shortly after TTL expiration
Payment successes that occur immediately after TTL expiration are treated as “updates within the grace period,” skipping release and prioritizing confirmation processing.
Success is significantly delayed after TTL expiration
Payments that occur after expiration do not confirm the reservation; instead, they are switched to refund or re‑reservation guidance.
Out‑of‑order events (failure arrives first, success later)
In payment systems, it can actually happen that payment_failed arrives first and payment_succeeded arrives later with a delay.
In this case, we do not “always confirm with the later event winning,” but only transition to confirmed when the following conditions are met:
payment_succeeded.occurred_atis withinhold_expires_at- And the relevant inventory has not yet been secured by another reservation (inventory still available)
If either condition is not met, the reservation remains EXPIRED and the success event is routed to refund or re‑reservation guidance.
This prevents erroneous confirmations and overbooking due to out‑of‑order events.
User cancels before payment
Note: 200 OK to the Webhook means “this event does not need to be resent,” and is not the refund process itself. Refunds and invalidations proceed separately as business processes in the background, via queue registration or DB records.
Performance Optimization
Enhanced system response speed and stability through database and delivery route optimization.
Developer Productivity & Quality Automation
Maintained continuous development velocity through quality assurance automation and build pipeline improvements.
Enhanced User Experience
Improved usability and reliability from the user's perspective, including search experiences and reservation systems.