Speeding Up Search Responses (Eliminating N+1)

gorm
lambda
golang
aurora

Published on 2021/09/29

2021/09/29

This post is also available in .

 Summary

Perspective
Details


Issue
A large number of N+1 queries were occurring in the list API, issuing a huge number of SELECTs for each page fetch.

Action
Redesigned the data access layer and fetched related data in bulk using Preload. Optimized complex-condition lookups with batched retrieval using IN clauses.

Observation
Introduced structured query logs keyed by request correlation ID, visualizing query count, type, and latency in real time.

Result / Outcome
Reduced query count, stabilizing DB load and connection pool utilization. Additionally, established a state where performance regressions can be continuously detected.

 Overview of the IssueIn the ticket management SaaS, we provided a list display API that joined multiple tables such as tickets, assignees (user), status, comments, and customer information.
The operations team reported that “the list display is slow” and “the API response feels heavy,” and monitoring metrics also showed persistent response latency.
In the initial implementation, after fetching the ticket body, related data was loaded individually in a loop.
var tickets []Ticket
db.Find(&tickets)
for i := range tickets {
  db.First(&tickets[i].User, tickets[i].UserID)
  db.First(&tickets[i].Status, tickets[i].StatusID)
  db.First(&tickets[i].Customer, tickets[i].CustomerID)
}
This code is a simplified sample illustrating a typical N+1 problem pattern.

In the actual environment, the structure included transaction management, but that is omitted here to clarify the essence of the problem (a loop structure that fetches related data one record at a time).
With this implementation, a large number of SELECT statements were issued per request, causing the so‑called N+1 query problem.
 Investigation / Measurement PhaseTo identify the causes of API response latency, we observed both the application and database sides.
First, we made Lambda function execution logs more detailed and changed the configuration so that processing time, query count, and SQL type were recorded for each request.

We propagated Lambda’s awsRequestID as a correlation ID across all layers, making it possible to trace logs per request.
We extended the ORM logger to output structured data for each SQL execution, including query type, execution time, and affected rows.

At the end of each request, we always output aggregated summaries such as total_sql, total_select, and total_insert, enabling immediate detection of which requests were issuing an abnormal number of queries.
On the database side, we enabled the slow query log and attached the application log’s request_id as an SQL comment.

This allowed us to cross-check the structured logs at the app layer with the execution status on the DB side and clearly confirm the N+1 pattern where many SELECTs were chained during list retrieval.
For implementation samples (correlation ID propagation, structured logger, GORM logger extension, SQL comment embedding), see “Reference.”
 Comparison / Evaluation PhaseTo eliminate N+1 and shorten response time, we compared multiple options and evaluated them from the perspectives of “query reduction / implementation cost / impact scope / operational load / rollback ease.”
 AdoptedPreload (Eager Loading)

Use the ORM to fetch related data in advance in a single batch. Consolidate the list and required relations into the minimum number of SQL statements.

Goal: Fundamentally eliminate N+1 / limit changes to the data access layer
Batch retrieval (IN clause)

Use IDs collected from the list to fetch related data in bulk and merge in memory.

Goal: Flexibly handle complex conditions that Preload alone cannot cover
 Not Adopted (for now)Short-TTL cache (read side)

Cache low‑volatility data such as status and basic assignee information with a short lifetime.

Goal: Smooth out repeated lookups / reduce DB load
Paging & column selection optimization

Revisit page size, sort keys, and selected columns to reduce transfer volume and decode cost.

Goal: Stabilize latency / prevent over‑fetching
 Simple Comparison

Option
Query Reduction
Implementation Cost
Impact Scope
Operational Load
Rollback Ease


Preload
◎
◎
Small
Small
◎

Batch retrieval (IN)
○
○
Medium
Small
○

Short-TTL cache
○
○
Small
Medium
○

Paging / column optimization
○
◎
Large
Small
◎

 Reasons for AdoptionChanges are confined to the data access layer, so no API spec or frontend changes are needed.
Direct reduction in query count is expected, leading straight to latency improvement.
Configuration differences are small, allowing phased rollout and instant rollback.
 Measurement Policy (Evaluation Axes)Per-request total query count / SELECT count / latency distribution.
Trends in error rate / timeouts / slow query count.
 Future Candidates (Deferred for Now)Handwritten SQL + JOIN optimization: very flexible but high maintenance cost. First see how far we can go with the ORM.
Pre-aggregation / denormalization: large design impact and recomputation cost. Reevaluate when access volume grows.
Note: More heavyweight measures (handwritten SQL, denormalization) will be reconsidered once improvements with the above options hit a ceiling.
 Notes on PreloadIn the ORM (GORM), this is the feature that “fetches related tables in advance in bulk.” When retrieving the ticket list, it fetches user, customer, and status information all at once, which is key to fundamentally eliminating the N+1 structure.
// Before: Fetch related data individually per ticket (N+1)
var tickets []Ticket
db.Find(&tickets)
for i := range tickets {
  db.First(&tickets[i].User, tickets[i].UserID)
  db.First(&tickets[i].Customer, tickets[i].CustomerID)
  db.First(&tickets[i].Status, tickets[i].StatusID)
}

// After: Fetch related data in bulk with Preload (consolidated into 1 query)
var tickets []Ticket
db.Preload("User").Preload("Customer").Preload("Status").Find(&tickets)
 Cases Where Batch Retrieval (IN Clause) Is EffectiveWhen you want related data whose conditions differ per parent

Example: “Only the latest comment for each ticket,” where LIMIT 1 / ORDER BY differs per parent.
When complex conditions or subqueries are needed

Example: “Attach User only when the Customer has a specific tag,” involving JOINs or existence checks.
When you want to consolidate multi-level (2+ levels) relations

Example: Fetch Ticket -> Customer -> Company second-level and beyond in bulk and merge in memory.
 Migration Plan Policy (Non-breaking Externally, Reversibility Emphasized)Do not change the API spec: keep the response structure as-is. Perform optimizations in the data access layer (DAO/Repository).
Switch via feature flag: control preload.enabled with an environment variable.
Allow immediate rollback: if an issue occurs, turn the flag OFF to revert to the old path.
 PreparationMeasurement foundation: dashboards based on request_id-centric structured logs (total SQL count / SELECT count / latency / slow queries).
Guardrails: set SLOs and alert thresholds (e.g., increase in total SQL, latency spike, timeout rate increase).
 Phased ReleaseScope selection: first enable it only for high-traffic list APIs, limited by target endpoint.
 Monitoring / AlertsReal-time monitoring: total SQL count, SELECT ratio, p95 / timeouts, slow query count, DB CPU / IO.
Automatic response to anomalies: automatically turn the flag OFF when thresholds are exceeded.
 Results Quantitative OutcomesReduction in total query count

The number of SQL statements per list API request decreased significantly (N+1 structure removed).
Stabilized response time

The main cause of latency—multiple SELECTs—was eliminated, reducing variance between requests.

Even at peak times, no timeouts occurred, and the API response time distribution converged into a stable range.
Sharp drop in slow queries

Slow query log entries caused by “repeated issuance of the same query” almost disappeared.

DB CPU load and I/O wait time also stabilized.
Improved resource efficiency

Fewer queries lowered connection pool utilization, reducing spikes in Lambda concurrency and DB connections.

As a result, infrastructure costs (DB and Lambda execution time) also trended downward.
 Qualitative OutcomesReduced development and operations burden

Since query optimization was confined to the data access layer, upper layers were unaffected and changes were easier.

The codebase became more unified, making it easier to incorporate N+1 detection and performance regression tests into routine operations.
Improved reusability of improvements

We generalized the Preload / IN-clause batch pattern so it can be rolled out to other modules in the same way.

Data retrieval logic was standardized, helping prevent recurrence of similar issues.
Better product experience

Perceived speed of list displays improved, boosting the work efficiency of support and sales teams.

Waiting time for customer inquiry handling and ticket searches decreased, reducing day‑to‑day operational stress.
 ReferenceBelow is an excerpt showing the mechanism for N+1 detection and log correlation.

The real code includes ORM configuration, PII masking, error handling, etc., but only the key points are shown here.
 Correlation ID Propagation (Lambda Handler Excerpt)func withRequestID(ctx context.Context, req events.APIGatewayProxyRequest) context.Context {
  rid := req.Headers["x-request-id"]
  if rid == "" {
    if awsID := ctx.Value("aws_request_id"); awsID != nil {
      if s, ok := awsID.(string); ok { rid = s }
    }
  }
  return context.WithValue(ctx, "request_id", rid)
}
 Structured Log Output (Main Part)func Log(ctx context.Context, level, msg string, f map[string]any) {
  f["ts"] = time.Now().UTC().Format(time.RFC3339Nano)
  if rid := ctx.Value("request_id"); rid != nil {
    f["request_id"] = rid
  }
  b, _ := json.Marshal(f)
  fmt.Println(string(b))
}
 GORM Logger Extension (Trace Excerpt)func (l *GormStructuredLogger) Trace(ctx context.Context, begin time.Time, fc func() (string, int64), err error) {
  sql, rows := fc()
  op := detectOp(sql)
  logx.Log(ctx, "debug", "sql_event", map[string]any{
    "op": op, "rows": rows, "elapsed_ms": time.Since(begin).Milliseconds(),
  })
}
 Embedding request_id in SQL Commentsfunc WithRequestIDTag(ctx context.Context, db *gorm.DB) *gorm.DB {
  if rid, ok := ctx.Value("request_id").(string); ok {
    return db.Set("gorm:query_option", fmt.Sprintf("/* request_id=%s */", rid))
  }
  return db
}
In the actual implementation, this is combined with log aggregation, SQL masking, and Prometheus metric emission.


Layer
Main Observation Points
What Information You Get
Main Purpose


API layer (Lambda handler)
- request_id (correlation ID)
- Processing start/end time
- Total query count / processing time
- Which requests are slow
- Which endpoints are experiencing latency
- “How many SQLs per request”
Identify which API is the bottleneck

ORM layer (GORM logger)
- Execution time of each SQL (elapsed_ms)
- Query type (SELECT/INSERT/UPDATE)
- Row count (rows)
- What SQLs are running and how often
- Signs of N+1 (SELECT bursts)
- Bias in UPDATE/INSERT
Detect structural anomalies such as N+1 within the app

DB layer (Slow Query Log)
- Actual SQL statements
- Execution time / lock time
- Comments like /* request_id=... */
- Which SQLs are slow
- Execution plans and index effectiveness
- Multiple queries issued within the same request
Cross-check app-side latency with DB load

 Summary of Log Destinations (by Layer) API Layer (Lambda Handler)Lambda stdout → CloudWatch Logs (Lambda/function name)
 ORM Layer (GORM Logger)Go app stdout → CloudWatch Logs (same log group)

Since API and ORM logs go to the same CloudWatch log, you can trace them together by request_id.
 DB Layer (RDS/Aurora Slow Query Log)RDS/Aurora Slow Query Log (on the DB side)
If needed, turning on “log export” will also send it to CloudWatch Logs.
SQL comments like /* request_id=... */ appear here as well.

Perspective	Details
Issue	A large number of N+1 queries were occurring in the list API, issuing a huge number of SELECTs for each page fetch.
Action	Redesigned the data access layer and fetched related data in bulk using Preload. Optimized complex-condition lookups with batched retrieval using IN clauses.
Observation	Introduced structured query logs keyed by request correlation ID, visualizing query count, type, and latency in real time.
Result / Outcome	Reduced query count, stabilizing DB load and connection pool utilization. Additionally, established a state where performance regressions can be continuously detected.

Option	Query Reduction	Implementation Cost	Impact Scope	Operational Load	Rollback Ease
Preload	◎	◎	Small	Small	◎
Batch retrieval (IN)	○	○	Medium	Small	○
Short-TTL cache	○	○	Small	Medium	○
Paging / column optimization	○	◎	Large	Small	◎

Layer	Main Observation Points	What Information You Get	Main Purpose
API layer (Lambda handler)	- `request_id` (correlation ID) - Processing start/end time - Total query count / processing time	- Which requests are slow - Which endpoints are experiencing latency - “How many SQLs per request”	Identify which API is the bottleneck
ORM layer (GORM logger)	- Execution time of each SQL (`elapsed_ms`) - Query type (`SELECT/INSERT/UPDATE`) - Row count (`rows`)	- What SQLs are running and how often - Signs of N+1 (SELECT bursts) - Bias in UPDATE/INSERT	Detect structural anomalies such as N+1 within the app
DB layer (Slow Query Log)	- Actual SQL statements - Execution time / lock time - Comments like `/* request_id=... */`	- Which SQLs are slow - Execution plans and index effectiveness - Multiple queries issued within the same request	Cross-check app-side latency with DB load

Performance Optimization

Enhanced system response speed and stability through database and delivery route optimization.

Developer Productivity & Quality Automation

Maintained continuous development velocity through quality assurance automation and build pipeline improvements.

Enhanced User Experience

Improved usability and reliability from the user's perspective, including search experiences and reservation systems.

Table of Contents

Speeding Up Search Responses (Eliminating N+1)

Summary

Overview of the Issue

Investigation / Measurement Phase

Comparison / Evaluation Phase

Adopted

Not Adopted (for now)

Simple Comparison

Reasons for Adoption

Measurement Policy (Evaluation Axes)

Future Candidates (Deferred for Now)

Notes on Preload

Cases Where Batch Retrieval (IN Clause) Is Effective

Migration Plan

Policy (Non-breaking Externally, Reversibility Emphasized)

Preparation

Phased Release

Monitoring / Alerts

Results

Quantitative Outcomes

Qualitative Outcomes

Reference

Correlation ID Propagation (Lambda Handler Excerpt)

Structured Log Output (Main Part)

GORM Logger Extension (Trace Excerpt)

Embedding request_id in SQL Comments

Summary of Log Destinations (by Layer)

API Layer (Lambda Handler)

ORM Layer (GORM Logger)

DB Layer (RDS/Aurora Slow Query Log)

Performance Optimization

Developer Productivity & Quality Automation

Enhanced User Experience

目次