Building a Three-Layer Test Foundation that Supports Continuous Improvement

  • jest
    jest
  • playwright
    playwright
  • testinglibrary
    testinglibrary
  • react
    react
  • expressjs
    expressjs
Published on 2024/08/10

Summary

Perspective Content
Issue Because we depended on manual verification, regressions occurred, verification costs increased, and procedures became person-dependent, which stalled the pace of improvements and large-scale upgrades.
Response Clearly defined three layers of Unit / Integration / E2E, and prepared an execution environment (CI, DB, mocks, data seeding) that makes tests easy to implement.
Operation Reduced writing and maintenance costs through test templating and shared Fixtures / Builders.
Established naming conventions that let you understand the cause of failure at a glance.
Results / Outcomes By achieving a state where we can have “confidence that nothing is broken,” we can safely perform refactoring and dependency upgrades. Manual checks were also greatly reduced.

Background and Issues

Because all pre-release quality checks were done manually:

  • Verification costs increased with every feature addition or refactor
  • We could not fully prevent regressions (breaking existing features) caused by subtle spec changes
  • It was difficult to share person-dependent “verification procedures,” and we could not maintain a reproducible quality assurance process

These issues had become apparent.

Also, although introducing TypeScript ensured a certain level of type safety, it did not cover actual behavior verification, and there still remained areas where “we couldn’t notice that things were broken.”

As a result, developers could not refactor with confidence, and the team’s improvement speed hit a ceiling.

We also could not proceed with upgrades of high-impact libraries (React, webpack, express, etc.).

Research and Measurement Phase

Before introducing automated tests, we visualized the existing quality assurance process and risk structure to clarify “what must be protected.”
The goal was not simply to increase the number of tests, but to guarantee system behavior at minimal cost.

Inventory of the Quality Assurance Process

Starting from the manual checklists, we mapped items along three axes: change frequency, incident rate, and user impact.
This allowed us to define “features that change frequently and have high impact when they fail” as high priority.
We ranked each area as “priority for behavior assurance” and clarified where to invest in test development.

Evaluation of Testability

For the main modules of React and Express, we extracted functions with many side effects and areas with strong state dependence.
We planned improvements such as function separation and dependency injection (DI) for structures that hinder testability, and built a foundation for continuous test development.

Introduction and Design (System Setup)

Before “writing” tests, we prioritized preparing the environment and design so that tests run correctly.

Design of the Test Foundation and Layer Structure

We defined a three-layer structure of unit tests, integration tests, and E2E tests, and clarified the role of each.

Layer Main Purpose Scope of Verification Granularity of Target Main Verification Viewpoints
Unit Validity of logic, pure functions, and methods Single module (no external dependencies) Function / class level Branch conditions, consistency of input and output, exception handling
Integration Inter-module collaboration and data flow Including DB, API, and external services Component level / API endpoint Request/response consistency
E2E (Scenario) Actual user operations and consistency of the entire system Browser + server Screen operations / scenario level UI flow, state transitions, UX reproducibility

We documented the responsibilities of each layer and unified the granularity and execution scope of test code.
We also redesigned the structure so that side effects can be injected from the outside, ensuring testability from the design stage.

Example)

Implementation side

// app/http/HttpClient.ts
export interface HttpClient {
  get<T>(url: string): Promise<T>
  post<T>(url: string, body: unknown): Promise<T>
}

export const fetchClient: HttpClient = {
  async get(url)  { const r = await fetch(url); return r.json() },
  async post(url, body) { const r = await fetch(url,{method:"POST",body:JSON.stringify(body)}); return r.json() }
}

export class ListingQuery {
  constructor(private http: HttpClient) {}
  async byId(id: string) { return this.http.get(`/api/listings/${id}`) }
}

Test code

const mockClient: HttpClient = {
  get: jest.fn().mockResolvedValue({ id: 1, title: "mock" }),
  post: jest.fn().mockResolvedValue({ ok: true }),
}

const query = new ListingQuery(mockClient)
expect(await query.byId("1")).toEqual({ id: 1, title: "mock" })

Preparation of Execution Environment and Operational Foundation

We adopted Jest (unit and integration) and Playwright (E2E) as the test foundation.
We designed the test environment and CI foundation with top priority on “being able to reliably reproduce failures.”

Measures to Maintain Reproducibility

  • Fixing dependency versions (eliminating environment differences via package-lock)
  • Initializing test data and fixing seeds to maintain consistent state
  • Mocking external APIs (msw / nock) to remove network dependencies
  • Fixing time and random numbers to suppress non-deterministic behavior

This enabled a test environment where “failures can be reproduced under the same conditions” both locally and in CI.

Mechanisms to Ensure Reliable Re-runs

  • Job design that considers parallel execution performance and cache characteristics (improved stability on CI)
  • Flexible adjustment of timeouts and retries to control execution independent of environment load
  • Automatic saving of logs, screenshots, and traces on failure to make debugging on re-run easier

In the CI environment, we built automatic execution per PR on GitHub Actions.
We realized an operation that emphasizes reproducibility, where tests can be re-run and analyzed under the same conditions even if they fail.

Implementation (Writing and Operating Tests)

Based on the foundation prepared in the design phase, we moved to the stage of “building up” tests.
The goal was not simply to increase coverage, but to build a mechanism that reliably detects when something breaks.

Establishing Unit Tests

We rigorously verified the correspondence between input and output, focusing on functions.

  • Unified naming conventions and purposes of test cases (happy path / error path / boundary values)
  • Aligned test file structure one-to-one with implementation files to ensure ease of reference

This made it possible to immediately identify the smallest broken unit when code changes.

Example)

describe('addUser', () => {
  describe('happy path', () => {
    it('New user is registered', () => { ... })
  })

  describe('error cases', () => {
    it('name will error if empty', () => { ... })
  })

  describe('boundary values', () => {
    it('can register with name of 1 character', () => { ... })
  })
})

Integration Tests (API, DB, Inter-Module Collaboration)

As a middle layer between unit tests and E2E tests, we designed tests to narrowly verify “dependencies that cannot be covered by a single module alone.”

The target was not “an entire feature,” but limited to the scope of one module plus its direct dependencies.

  • Integration tests for the API layer
    Using supertest, we sent real requests at the Express handler level. We verified connections with the business logic layer and authentication middleware, and checked consistency of status codes, response structures, and validation errors.

  • Integration tests for the DB access layer
    We executed CRUD operations against a real MongoDB (local / container environment). We checked the impact of schema changes and index settings, and ensured consistency of type definitions, persistence, and restoration.

  • Integration tests for external integration modules
    For Webhooks and external API calls, we used msw/node to stub them. We hooked actual HTTP requests and verified retry control, error handling, and request structure consistency. By not fully mocking the communication layer and leaving HTTP-level interactions, we achieved integration assurance in a form close to the real environment.

To prevent data races during parallel execution, we generated independent schema names and temporary data per test, thoroughly designing tests to be re-runnable.

This layer enabled us to detect “integration inconsistencies that previously could only be noticed by E2E” in advance.

E2E Tests (Self-Contained and Reproducibility-Oriented)

We designed E2E tests based on the principles of “self-contained” and “full reproducibility.” We built a configuration that completes within a single CI job, without depending on external environments or manual operations.

Execution Pipeline (Single-Job Completion on GitHub Actions)

  1. Install dependencies & build
  2. Start the app (in the background)
  3. Initialize data
  4. Run Playwright
    • Trace collection: on-first-retry
    • Report output: html
  5. Artifact collection & cleanup
    • Save screenshots / reports / traces
    • Ensure processes are terminated in the final step (equivalent to finally)

Parallelization Strategy (Avoiding Instability)

  • Prioritize sharding: split the entire test suite into multiple jobs (shards) to shorten time.
    → Safely scale via CI matrix. Less likely to create differences from local runs.
  • Be cautious with workers: use workers=1 as the default.
    → Avoid test flakiness caused by port conflicts, shared state, and I/O load.

Operation and Improvement Phase

After introduction, we focused operations on continuously detecting and preventing regressions.

Points of Attention During and After Setup

  • Avoid “tests for the sake of tests”
    Add tests only in the necessary scope, starting from actual bugs or requirement changes. Position tests as a means for quality assurance, not as an end in themselves.

  • Write tests that reveal the cause when they fail
    Clarify test names and output messages. For example, enforce naming that conveys intent and preconditions in one line, such as should return 400 when missing header.

  • Reduce test maintenance costs
    Introduce shared fixtures / builders and centrally manage test data.
    Concentrate follow-up changes in one place and increase refactor tolerance.

  • Use “reliability” rather than “coverage” as the metric
    Instead of chasing coverage numbers, adopt “whether we can reliably notice when something breaks” as the primary metric.
    In test reviews, we also discussed whether there is a “guarantee of noticing.”

  • Balance with CI execution time
    Optimize parallel execution and cache strategies, and maintain a test foundation that completes within 10 minutes.

Detection and Isolation of Flaky Tests

  • Failure re-run policy: retry: 2 / on-first-retry: trace.
  • Flake rate threshold: if it exceeds X%, attach a quarantine label, exclude it from the E2E suite, and improve it separately.
  • Standardization of failure logs: always save screenshots, videos, and traces as artifacts, and automatically attach reproduction steps.

Reducing Slow Tests

  • Categorization of bottlenecks: network waits, DB initialization, excessive rendering, excessive dependence on E2E.
  • Countermeasure catalog:
    • Convert API/DB checks into integration tests (reduce dependence on E2E)
    • Differential initialization of fixtures
      Instead of resetting all data every time, initialize only the range needed by the test.
      We designed it so that running the same process multiple times does not break the state, and greatly shortened execution time while maintaining reproducibility.

Outcomes

  • We gained “confidence that nothing is broken,” which sped up decision-making for large refactors and dependency upgrades.
  • We formalized knowledge and understanding of specifications gained through incident response not as documents but as test code.

Next Steps

Introduction of Differential Test Execution (Test Selection)

Instead of running all tests every time, we will introduce a mechanism that re-runs only the affected scope based on change diffs (paths, commit history, dependency graph).

  • Automatically analyze test files corresponding to code changes
  • Cache the dependency graph of tests and run selected tests
  • Accumulate results as metadata to improve the accuracy of impact range estimation

This aims to shorten CI time while maintaining regression detection accuracy.

Dynamic Optimization of CI Parallelism (History-Based)

We will dynamically optimize CI parallelism based on test execution history.

  • Collect average and P95 execution times for each test suite
  • Analyze execution history and automatically adjust --shard count and workers count in the next job
  • Periodically rebalance and visualize resource utilization

This will allow us to control CI load in a data-driven way rather than with fixed values, optimizing the balance of time, resources, and reliability.

https://shinagawa-web.com/en/blogs/test-automation-enhancement

https://shinagawa-web.com/en/blogs/nextjs-app-router-testing-setup

References

Identifying and Improving Slow Tests

1. Network / External Dependencies (DB, API, S3, etc.)

Symptom: Multi-second blocking due to HTTP waits, DNS delays, and external rate limits.

Countermeasures (TypeScript)

  • Mock HTTP with nock / msw
  • Block external APIs with Playwright’s page.route()
  • Speed up DB with mongodb-memory-server / SQLite (in-memory)
  • Run migrations only once before the suite

Countermeasures (Go)

  • Use httptest.Server to localize external APIs
  • Make container reuse the default for testcontainers / dockertest (reduce startup overhead)

2. sleep / Timeout Waits / Polling

Symptom: Accumulation of sleep(1000) makes the entire test suite take minutes.

Countermeasures (TypeScript)

  • Use fake timers (Jest/Vitest)
  • Explicitly specify the minimum timeout for waitFor

Countermeasures (Go)

  • Abstract time dependence into a Clock interface and inject it
  • Eliminate direct use of time.After and use a fast clock in tests

3. Heavy Crypto / Hash / Password Processing

Symptom: Cost of bcrypt / argon2 makes a single case take hundreds of ms to seconds.

Countermeasures

  • Lower the cost factor during tests
  • Swap the hash function for a faster implementation (switch via DI)

4. Overuse of E2E

Symptom: E2E becomes the main tool and execution time becomes minutes.

Countermeasures

  • Optimize the test pyramid: downgrade E2E to Integration where possible
  • Limit E2E to critical paths
  • Eliminate unnecessary waitForTimeout

Measurement and Visualization (TypeScript)

You can extract particularly slow tests with jest-slow-test-reporter.

https://github.com/jodonnell/jest-slow-test-reporter

Also, resource consumption and handle leaks are provided as Jest options.

  • --logHeapUsage
    Outputs heap usage at the end of each test file.
    Enables early detection of memory leaks and cache bloat, and identification of heavy tests.

  • --detectOpenHandles
    Detects handles that remain open after execution (unclosed sockets, timers, etc.).
    Helps find missing awaits in asynchronous processing and contributes to stabilizing E2E and integration tests.
    Use only for debugging, not as a default.

Developer Productivity & Quality Automation

Maintained continuous development velocity through quality assurance automation and build pipeline improvements.