Speeding Up Search Responses (Eliminating N+1)
Summary
| Perspective | Details |
|---|---|
| Issue | A large number of N+1 queries were occurring in the list API, issuing a huge number of SELECTs for each page fetch. |
| Action | Redesigned the data access layer and fetched related data in bulk using Preload. Optimized complex-condition lookups with batched retrieval using IN clauses. |
| Observation | Introduced structured query logs keyed by request correlation ID, visualizing query count, type, and latency in real time. |
| Result / Outcome | Reduced query count, stabilizing DB load and connection pool utilization. Additionally, established a state where performance regressions can be continuously detected. |
Overview of the Issue
In the ticket management SaaS, we provided a list display API that joined multiple tables such as tickets, assignees (user), status, comments, and customer information.
The operations team reported that “the list display is slow” and “the API response feels heavy,” and monitoring metrics also showed persistent response latency.
In the initial implementation, after fetching the ticket body, related data was loaded individually in a loop.
var tickets []Ticket
db.Find(&tickets)
for i := range tickets {
db.First(&tickets[i].User, tickets[i].UserID)
db.First(&tickets[i].Status, tickets[i].StatusID)
db.First(&tickets[i].Customer, tickets[i].CustomerID)
}
This code is a simplified sample illustrating a typical N+1 problem pattern.
In the actual environment, the structure included transaction management, but that is omitted here to clarify the essence of the problem (a loop structure that fetches related data one record at a time).
With this implementation, a large number of SELECT statements were issued per request, causing the so‑called N+1 query problem.
Investigation / Measurement Phase
To identify the causes of API response latency, we observed both the application and database sides.
First, we made Lambda function execution logs more detailed and changed the configuration so that processing time, query count, and SQL type were recorded for each request.
We propagated Lambda’s awsRequestID as a correlation ID across all layers, making it possible to trace logs per request.
We extended the ORM logger to output structured data for each SQL execution, including query type, execution time, and affected rows.
At the end of each request, we always output aggregated summaries such as total_sql, total_select, and total_insert, enabling immediate detection of which requests were issuing an abnormal number of queries.
On the database side, we enabled the slow query log and attached the application log’s request_id as an SQL comment.
This allowed us to cross-check the structured logs at the app layer with the execution status on the DB side and clearly confirm the N+1 pattern where many SELECTs were chained during list retrieval.
For implementation samples (correlation ID propagation, structured logger, GORM logger extension, SQL comment embedding), see “Reference.”
Comparison / Evaluation Phase
To eliminate N+1 and shorten response time, we compared multiple options and evaluated them from the perspectives of “query reduction / implementation cost / impact scope / operational load / rollback ease.”
Adopted
-
Preload (Eager Loading)
Use the ORM to fetch related data in advance in a single batch. Consolidate the list and required relations into the minimum number of SQL statements.
Goal: Fundamentally eliminate N+1 / limit changes to the data access layer -
Batch retrieval (IN clause)
Use IDs collected from the list to fetch related data in bulk and merge in memory.
Goal: Flexibly handle complex conditions that Preload alone cannot cover
Not Adopted (for now)
-
Short-TTL cache (read side)
Cache low‑volatility data such as status and basic assignee information with a short lifetime.
Goal: Smooth out repeated lookups / reduce DB load -
Paging & column selection optimization
Revisit page size, sort keys, and selected columns to reduce transfer volume and decode cost.
Goal: Stabilize latency / prevent over‑fetching
Simple Comparison
| Option | Query Reduction | Implementation Cost | Impact Scope | Operational Load | Rollback Ease |
|---|---|---|---|---|---|
| Preload | ◎ | ◎ | Small | Small | ◎ |
| Batch retrieval (IN) | ○ | ○ | Medium | Small | ○ |
| Short-TTL cache | ○ | ○ | Small | Medium | ○ |
| Paging / column optimization | ○ | ◎ | Large | Small | ◎ |
Reasons for Adoption
- Changes are confined to the data access layer, so no API spec or frontend changes are needed.
- Direct reduction in query count is expected, leading straight to latency improvement.
- Configuration differences are small, allowing phased rollout and instant rollback.
Measurement Policy (Evaluation Axes)
- Per-request total query count / SELECT count / latency distribution.
- Trends in error rate / timeouts / slow query count.
Future Candidates (Deferred for Now)
- Handwritten SQL + JOIN optimization: very flexible but high maintenance cost. First see how far we can go with the ORM.
- Pre-aggregation / denormalization: large design impact and recomputation cost. Reevaluate when access volume grows.
Note: More heavyweight measures (handwritten SQL, denormalization) will be reconsidered once improvements with the above options hit a ceiling.
Notes on Preload
In the ORM (GORM), this is the feature that “fetches related tables in advance in bulk.” When retrieving the ticket list, it fetches user, customer, and status information all at once, which is key to fundamentally eliminating the N+1 structure.
// Before: Fetch related data individually per ticket (N+1)
var tickets []Ticket
db.Find(&tickets)
for i := range tickets {
db.First(&tickets[i].User, tickets[i].UserID)
db.First(&tickets[i].Customer, tickets[i].CustomerID)
db.First(&tickets[i].Status, tickets[i].StatusID)
}
// After: Fetch related data in bulk with Preload (consolidated into 1 query)
var tickets []Ticket
db.Preload("User").Preload("Customer").Preload("Status").Find(&tickets)
Cases Where Batch Retrieval (IN Clause) Is Effective
- When you want related data whose conditions differ per parent
Example: “Only the latest comment for each ticket,” whereLIMIT 1/ORDER BYdiffers per parent. - When complex conditions or subqueries are needed
Example: “Attach User only when the Customer has a specific tag,” involving JOINs or existence checks. - When you want to consolidate multi-level (2+ levels) relations
Example: FetchTicket -> Customer -> Companysecond-level and beyond in bulk and merge in memory.
Migration Plan
Policy (Non-breaking Externally, Reversibility Emphasized)
- Do not change the API spec: keep the response structure as-is. Perform optimizations in the data access layer (DAO/Repository).
- Switch via feature flag: control
preload.enabledwith an environment variable. - Allow immediate rollback: if an issue occurs, turn the flag OFF to revert to the old path.
Preparation
- Measurement foundation: dashboards based on
request_id-centric structured logs (total SQL count / SELECT count / latency / slow queries). - Guardrails: set SLOs and alert thresholds (e.g., increase in total SQL, latency spike, timeout rate increase).
Phased Release
- Scope selection: first enable it only for high-traffic list APIs, limited by target endpoint.
Monitoring / Alerts
- Real-time monitoring: total SQL count, SELECT ratio, p95 / timeouts, slow query count, DB CPU / IO.
- Automatic response to anomalies: automatically turn the flag OFF when thresholds are exceeded.
Results
Quantitative Outcomes
-
Reduction in total query count
The number of SQL statements per list API request decreased significantly (N+1 structure removed). -
Stabilized response time
The main cause of latency—multiple SELECTs—was eliminated, reducing variance between requests.
Even at peak times, no timeouts occurred, and the API response time distribution converged into a stable range. -
Sharp drop in slow queries
Slow query log entries caused by “repeated issuance of the same query” almost disappeared.
DB CPU load and I/O wait time also stabilized. -
Improved resource efficiency
Fewer queries lowered connection pool utilization, reducing spikes in Lambda concurrency and DB connections.
As a result, infrastructure costs (DB and Lambda execution time) also trended downward.
Qualitative Outcomes
-
Reduced development and operations burden
Since query optimization was confined to the data access layer, upper layers were unaffected and changes were easier.
The codebase became more unified, making it easier to incorporate N+1 detection and performance regression tests into routine operations. -
Improved reusability of improvements
We generalized the Preload / IN-clause batch pattern so it can be rolled out to other modules in the same way.
Data retrieval logic was standardized, helping prevent recurrence of similar issues. -
Better product experience
Perceived speed of list displays improved, boosting the work efficiency of support and sales teams.
Waiting time for customer inquiry handling and ticket searches decreased, reducing day‑to‑day operational stress.
Reference
Below is an excerpt showing the mechanism for N+1 detection and log correlation.
The real code includes ORM configuration, PII masking, error handling, etc., but only the key points are shown here.
Correlation ID Propagation (Lambda Handler Excerpt)
func withRequestID(ctx context.Context, req events.APIGatewayProxyRequest) context.Context {
rid := req.Headers["x-request-id"]
if rid == "" {
if awsID := ctx.Value("aws_request_id"); awsID != nil {
if s, ok := awsID.(string); ok { rid = s }
}
}
return context.WithValue(ctx, "request_id", rid)
}
Structured Log Output (Main Part)
func Log(ctx context.Context, level, msg string, f map[string]any) {
f["ts"] = time.Now().UTC().Format(time.RFC3339Nano)
if rid := ctx.Value("request_id"); rid != nil {
f["request_id"] = rid
}
b, _ := json.Marshal(f)
fmt.Println(string(b))
}
GORM Logger Extension (Trace Excerpt)
func (l *GormStructuredLogger) Trace(ctx context.Context, begin time.Time, fc func() (string, int64), err error) {
sql, rows := fc()
op := detectOp(sql)
logx.Log(ctx, "debug", "sql_event", map[string]any{
"op": op, "rows": rows, "elapsed_ms": time.Since(begin).Milliseconds(),
})
}
Embedding request_id in SQL Comments
func WithRequestIDTag(ctx context.Context, db *gorm.DB) *gorm.DB {
if rid, ok := ctx.Value("request_id").(string); ok {
return db.Set("gorm:query_option", fmt.Sprintf("/* request_id=%s */", rid))
}
return db
}
In the actual implementation, this is combined with log aggregation, SQL masking, and Prometheus metric emission.
| Layer | Main Observation Points | What Information You Get | Main Purpose |
|---|---|---|---|
| API layer (Lambda handler) | - request_id (correlation ID)- Processing start/end time - Total query count / processing time |
- Which requests are slow - Which endpoints are experiencing latency - “How many SQLs per request” |
Identify which API is the bottleneck |
| ORM layer (GORM logger) | - Execution time of each SQL (elapsed_ms)- Query type ( SELECT/INSERT/UPDATE)- Row count ( rows) |
- What SQLs are running and how often - Signs of N+1 (SELECT bursts) - Bias in UPDATE/INSERT |
Detect structural anomalies such as N+1 within the app |
| DB layer (Slow Query Log) | - Actual SQL statements - Execution time / lock time - Comments like /* request_id=... */ |
- Which SQLs are slow - Execution plans and index effectiveness - Multiple queries issued within the same request |
Cross-check app-side latency with DB load |
Summary of Log Destinations (by Layer)
API Layer (Lambda Handler)
- Lambda stdout → CloudWatch Logs (Lambda/function name)
ORM Layer (GORM Logger)
- Go app stdout → CloudWatch Logs (same log group)
Since API and ORM logs go to the same CloudWatch log, you can trace them together byrequest_id.
DB Layer (RDS/Aurora Slow Query Log)
- RDS/Aurora Slow Query Log (on the DB side)
- If needed, turning on “log export” will also send it to CloudWatch Logs.
- SQL comments like
/* request_id=... */appear here as well.
Performance Optimization
Enhanced system response speed and stability through database and delivery route optimization.
Developer Productivity & Quality Automation
Maintained continuous development velocity through quality assurance automation and build pipeline improvements.
Enhanced User Experience
Improved usability and reliability from the user's perspective, including search experiences and reservation systems.