A Journey of Trial and Error to Eliminate 'Occasionally Failing Tests' in Go Concurrency

  • golang
    golang
Published on 2026/01/15

This post is also available in .

Expectations for Performance Improvement and the Technical Hurdles Faced in a CI Environment

When managing documentation in Markdown, checking for broken links is an essential process.

I am currently developing a Go-based Markdown linter called gomarklint. This tool validates the structure and style of Markdown, and one of its most important features is "checking the validity of external links."

How to Handle 180 Files and 100,000 Lines of Markdown

The target for this tool is the documentation of large-scale projects. For example, I'm envisioning cases where it needs to handle over 180 Markdown files totaling more than 100,000 lines.
(This is the volume of my own tech blog.)
Checking these files sequentially with a single thread would take an enormous amount of time, significantly harming the development experience.

Go's greatest strength lies in its powerful concurrency through Goroutines. "With proper parallelization, I should be able to verify a vast number of external links in an instant," I thought, and so I began optimizing the implementation.

The Emergence of the 'Unstable Test' Problem

The initial implementation was quite simple: extract links and launch a go statement for each one. Benchmarks on my local environment showed the expected speed, and at first glance, it seemed like a success.

However, the real challenge emerged when I moved to a CI environment like GitHub Actions.

I encountered "Flaky Tests"—tests that produce different results with each run. Nine out of ten times they would pass, but the remaining one time, they would fail with an inexplicable error. When I tried to debug, I couldn't reproduce the issue locally.

What was the source of this eerie instability?

Through this problem, I learned that achieving both "speed" and "stability" in parallel HTTP requests requires more than just simple parallelization; it requires a certain "etiquette." This article traces the three technical hurdles I faced during the development of gomarklint and how I overcame them.

Step 1: Stopping the 'Barrage' of Requests to Single URLs

As the first step in parallelization, I implemented a system that assigned a Goroutine to each extracted link sequentially. However, I immediately ran into the first problem.

Redundant Requests to the Same URL

Within Markdown documents, the same URL (e.g., the project's homepage, a common document, or a link to a GitHub repository) often appears repeatedly.

With simple parallelization, an HTTP request is sent for every occurrence of that URL, almost simultaneously. When dealing with hundreds of files, this results in a massive concentration of requests to the same host in a short period.

This is not only a waste of resources but can also be perceived by the server as a "DoS attack" or "suspicious access," leading to connection refusals (429 Too Many Requests) or IP-based rate limiting.

Implementing a URL Cache with sync.Map

To solve this problem, I introduced a URL cache that could be used across files. This mechanism saves the result of a URL check once it's completed and reuses that result for subsequent checks without making a network request.

When performing concurrent operations in Go, a standard map is not thread-safe, and reading from or writing to it from multiple Goroutines simultaneously will cause a panic. Therefore, I used the standard library's sync.Map.

urlCache := &sync.Map{}

if cachedStatus, ok := urlCache.Load(url); ok {
    status = cachedStatus.(int)
} else {
    status, err = checkURL(client, url)
    urlCache.Store(url, status)
}

Observations After Implementation

Introducing this cache dramatically reduced network traffic. Especially for documentation with many links to the same domain, the execution speed improved further, and the load on the server was kept within a "polite" range.

However, this alone was not enough to eliminate the "occasionally failing tests."

Step 2: Controlling Physical 'Resource Limits' (Semaphores)

By introducing a URL cache, I eliminated duplicate requests to the same URL. However, when the number of unique URLs to be checked is vast, a new problem arises.

Network Saturation Caused by Thousands of Goroutines

For example, when checking a document containing 1,000 different URLs, 1,000 Goroutines still attempt to make HTTP requests almost simultaneously.

This "instantaneous burst of connections" creates the following risks:

  • Local Resource Exhaustion: The OS may hit its limit on the number of file descriptors (sockets) it can open at once, causing connection errors.
  • Network Instability: A large volume of packets flowing in a short time can cause some requests to time out.
  • CI Environment Constraints: Shared environments like GitHub Actions often have limits on network bandwidth and concurrent connections, making errors more likely than on a local machine.

This was one of the reasons why tests were "occasionally" failing only in the CI environment.

Implementing a 'Semaphore' with Channels

In Go, you can easily implement the "Semaphore" pattern to limit the number of concurrently executing Goroutines using a buffered channel.

maxConcurrency := 10
sem := make(chan struct{}, maxConcurrency)

for url, lines := range urlToLines {
    wg.Add(1)
    
    sem <- struct{}{} 

    go func(u string, lns []int) {
        defer wg.Done()
        defer func() { <-sem }()

    }(url, lines)
}

Observations After Implementation

By introducing a semaphore, requests began to flow "in sequence" rather than "all at once." While this might seem to slow down processing, it actually drastically reduced timeout errors, and as a result, the overall execution time became more stable.

I learned the importance of implementing client-side flow control (Throttling), based on the premise that "physical resources have limits."

However, even after taking these measures, the tests still sometimes turned red.

Step 3: Tolerating Momentary 'Whims' (Retries)

I had reduced waste with caching and controlled the flow with a semaphore. In theory, this should have made things stable, but the world of networking has unavoidable "Transient Failures."

The Uncertainty of Networks Where '100%' Doesn't Exist

The remote server might be momentarily overloaded, a packet might be lost along the network path, or the unstable Wi-Fi (or virtual network) of the CI environment might briefly disconnect... These cases of "it just happened to fail at that exact moment" are impossible to prevent, no matter how perfectly you control your side.

This was the final culprit behind the Flaky Tests occurring in the CI environment.

'Smart Retries' and Exponential Backoff

Simply retrying immediately after an error can just add to the load on a server that is already down. What's important here is a retry strategy that incorporates Exponential Backoff.

It's also crucial to distinguish between "errors that are futile to retry" (like a 404 Not Found) and "errors worth retrying" (like 5xx server errors or connection timeouts).

func checkURLWithRetry(client *http.Client, url string) (int, error) {
    const maxRetries = 2
    const retryDelay = 2 * time.Second

    var status int
    var err error

    for i := 0; i <= maxRetries; i++ {
        if i > 0 {
            time.Sleep(retryDelay * time.Duration(i))
        }

        status, err = performCheck(client, url)

        if err == nil && (status < 400 || status == 404 || status == 401) {
            return status, nil
        }
        
    }
    return status, err
}

Observations After Implementation

By incorporating this "persistence" into the implementation, the test results on CI finally began to stabilize. I learned that instead of giving up after a single failure, providing a grace period to "ask again after a short wait" is proper etiquette when interacting with a distributed system (the Web).

The Trap of 'Half-Baked Caching' That Undermines All Improvements

Now, there's one more improvement that was necessary to eliminate the "occasionally failing tests" mentioned in this article's title.

In fact, even after implementing steps ① to ③, my tests were still turning red from time to time. After digging into the cause, I arrived at a blind spot: "what was being cached."

Initially, I was only storing the HTTP status code (an int) in the sync.Map.

  • First Access: A network error (e.g., timeout) occurs. The status code is 0, and the error is timeout.
  • Cache Storage: Only 0 is stored in the map.
  • Second Access (from another file, etc.): The value 0 is retrieved from the cache. However, since the cache doesn't retain the information that "an error occurred," the caller receives a contradictory state: "no error, but the status is 0 (neither success nor failure)."

This inconsistency was corrupting the link-checking logic and making the tests unstable.

Solution: Cache the Entire Result in a Struct

Both successes and failures must be treated equally as "the result of checking that URL." To do this, I needed to store a struct containing both the status code and the error object in the cache.

type checkResult struct {
    status int
    err    error
}

urlCache.Store(url, &checkResult{status: status, err: err})

By correctly implementing this Negative Caching (caching failure results), the consistency during parallel execution was finally fully maintained.

Conclusion: The Never-Ending Road to Stable Concurrency

By assembling the three sacred treasures of "caching," "semaphores," and "retries," and further implementing Negative Caching to include failure results, my external link checker became dramatically more stable.

However, even with all this, I can't say that I've managed to eliminate "Flaky Tests" 100% in my environment.

Something Is Still Missing

No matter how many countermeasures you stack up, arriving at the "correct" solution for network-related concurrency is not easy. For example, the following cases remain as concerns:

  • Cache Stampede: The problem where a second and third Goroutine simultaneously initiate requests for the same URL in the "brief moment" before the first cache entry is written.
  • Fine-tuning http.Client: How do low-level settings like MaxIdleConnsPerHost affect performance during parallel execution?
  • CI Environment-Specific Fluctuations: Unpredictable packet loss and latency in a virtualized network environment.

The world of concurrent programming is deep, and every time you overcome one wall, a new one appears. That is both the difficulty and the fun of this development process.

Please Lend Me Your Expertise (A Request to Readers)

The code incorporating these processes is currently available as gomarklint.

If you've read this article and have any insights, such as "this part looks suspicious," "for Go's http.Client, you should use this setting," or "for Cache Stampede, you should use singleflight," I would be very grateful if you could share them.

  • GitHub Issues / PRs: Specific improvement suggestions and bug reports are highly welcome.
  • Comments / Social Media: Stories about "how I solved it" are also very encouraging.

The journey to achieve both "blazing speed" and "absolute stability" continues. I would be delighted if you could help me grow this linter into a more robust tool with your expertise.

https://github.com/shinagawa-web/gomarklint

Xでシェア
Facebookでシェア
LinkedInでシェア

Questions about this article 📝

If you have any questions or feedback about the content, please feel free to contact us.
Go to inquiry form