A Journey of Trial and Error to Eliminate 'Occasionally Failing Tests' in Go Concurrency
This post is also available in .
Expectations for Performance Improvement and the Technical Hurdles Faced in a CI Environment
When managing documentation in Markdown, checking for broken links is an essential process.
I am currently developing a Go-based Markdown linter called gomarklint. This tool validates the structure and style of Markdown, and one of its most important features is "checking the validity of external links."
How to Handle 180 Files and 100,000 Lines of Markdown
The target for this tool is the documentation of large-scale projects. For example, I'm envisioning cases where it needs to handle over 180 Markdown files totaling more than 100,000 lines.
(This is the volume of my own tech blog.)
Checking these files sequentially with a single thread would take an enormous amount of time, significantly harming the development experience.
Go's greatest strength lies in its powerful concurrency through Goroutines. "With proper parallelization, I should be able to verify a vast number of external links in an instant," I thought, and so I began optimizing the implementation.
The Emergence of the 'Unstable Test' Problem
The initial implementation was quite simple: extract links and launch a go statement for each one. Benchmarks on my local environment showed the expected speed, and at first glance, it seemed like a success.
However, the real challenge emerged when I moved to a CI environment like GitHub Actions.
I encountered "Flaky Tests"—tests that produce different results with each run. Nine out of ten times they would pass, but the remaining one time, they would fail with an inexplicable error. When I tried to debug, I couldn't reproduce the issue locally.
What was the source of this eerie instability?
Through this problem, I learned that achieving both "speed" and "stability" in parallel HTTP requests requires more than just simple parallelization; it requires a certain "etiquette." This article traces the three technical hurdles I faced during the development of gomarklint and how I overcame them.
Step 1: Stopping the 'Barrage' of Requests to Single URLs
As the first step in parallelization, I implemented a system that assigned a Goroutine to each extracted link sequentially. However, I immediately ran into the first problem.
Redundant Requests to the Same URL
Within Markdown documents, the same URL (e.g., the project's homepage, a common document, or a link to a GitHub repository) often appears repeatedly.
With simple parallelization, an HTTP request is sent for every occurrence of that URL, almost simultaneously. When dealing with hundreds of files, this results in a massive concentration of requests to the same host in a short period.
This is not only a waste of resources but can also be perceived by the server as a "DoS attack" or "suspicious access," leading to connection refusals (429 Too Many Requests) or IP-based rate limiting.
Implementing a URL Cache with sync.Map
To solve this problem, I introduced a URL cache that could be used across files. This mechanism saves the result of a URL check once it's completed and reuses that result for subsequent checks without making a network request.
When performing concurrent operations in Go, a standard map is not thread-safe, and reading from or writing to it from multiple Goroutines simultaneously will cause a panic. Therefore, I used the standard library's sync.Map.
urlCache := &sync.Map{}
if cachedStatus, ok := urlCache.Load(url); ok {
status = cachedStatus.(int)
} else {
status, err = checkURL(client, url)
urlCache.Store(url, status)
}
Observations After Implementation
Introducing this cache dramatically reduced network traffic. Especially for documentation with many links to the same domain, the execution speed improved further, and the load on the server was kept within a "polite" range.
However, this alone was not enough to eliminate the "occasionally failing tests."
Step 2: Controlling Physical 'Resource Limits' (Semaphores)
By introducing a URL cache, I eliminated duplicate requests to the same URL. However, when the number of unique URLs to be checked is vast, a new problem arises.
Network Saturation Caused by Thousands of Goroutines
For example, when checking a document containing 1,000 different URLs, 1,000 Goroutines still attempt to make HTTP requests almost simultaneously.
This "instantaneous burst of connections" creates the following risks:
- Local Resource Exhaustion: The OS may hit its limit on the number of file descriptors (sockets) it can open at once, causing connection errors.
- Network Instability: A large volume of packets flowing in a short time can cause some requests to time out.
- CI Environment Constraints: Shared environments like GitHub Actions often have limits on network bandwidth and concurrent connections, making errors more likely than on a local machine.
This was one of the reasons why tests were "occasionally" failing only in the CI environment.
Implementing a 'Semaphore' with Channels
In Go, you can easily implement the "Semaphore" pattern to limit the number of concurrently executing Goroutines using a buffered channel.
maxConcurrency := 10
sem := make(chan struct{}, maxConcurrency)
for url, lines := range urlToLines {
wg.Add(1)
sem <- struct{}{}
go func(u string, lns []int) {
defer wg.Done()
defer func() { <-sem }()
}(url, lines)
}
Observations After Implementation
By introducing a semaphore, requests began to flow "in sequence" rather than "all at once." While this might seem to slow down processing, it actually drastically reduced timeout errors, and as a result, the overall execution time became more stable.
I learned the importance of implementing client-side flow control (Throttling), based on the premise that "physical resources have limits."
However, even after taking these measures, the tests still sometimes turned red.
Step 3: Tolerating Momentary 'Whims' (Retries)
I had reduced waste with caching and controlled the flow with a semaphore. In theory, this should have made things stable, but the world of networking has unavoidable "Transient Failures."
The Uncertainty of Networks Where '100%' Doesn't Exist
The remote server might be momentarily overloaded, a packet might be lost along the network path, or the unstable Wi-Fi (or virtual network) of the CI environment might briefly disconnect... These cases of "it just happened to fail at that exact moment" are impossible to prevent, no matter how perfectly you control your side.
This was the final culprit behind the Flaky Tests occurring in the CI environment.
'Smart Retries' and Exponential Backoff
Simply retrying immediately after an error can just add to the load on a server that is already down. What's important here is a retry strategy that incorporates Exponential Backoff.
It's also crucial to distinguish between "errors that are futile to retry" (like a 404 Not Found) and "errors worth retrying" (like 5xx server errors or connection timeouts).
func checkURLWithRetry(client *http.Client, url string) (int, error) {
const maxRetries = 2
const retryDelay = 2 * time.Second
var status int
var err error
for i := 0; i <= maxRetries; i++ {
if i > 0 {
time.Sleep(retryDelay * time.Duration(i))
}
status, err = performCheck(client, url)
if err == nil && (status < 400 || status == 404 || status == 401) {
return status, nil
}
}
return status, err
}
Observations After Implementation
By incorporating this "persistence" into the implementation, the test results on CI finally began to stabilize. I learned that instead of giving up after a single failure, providing a grace period to "ask again after a short wait" is proper etiquette when interacting with a distributed system (the Web).
The Trap of 'Half-Baked Caching' That Undermines All Improvements
Now, there's one more improvement that was necessary to eliminate the "occasionally failing tests" mentioned in this article's title.
In fact, even after implementing steps ① to ③, my tests were still turning red from time to time. After digging into the cause, I arrived at a blind spot: "what was being cached."
Initially, I was only storing the HTTP status code (an int) in the sync.Map.
- First Access: A network error (e.g., timeout) occurs. The status code is
0, and the error istimeout. - Cache Storage: Only
0is stored in the map. - Second Access (from another file, etc.): The value
0is retrieved from the cache. However, since the cache doesn't retain the information that "an error occurred," the caller receives a contradictory state: "no error, but the status is 0 (neither success nor failure)."
This inconsistency was corrupting the link-checking logic and making the tests unstable.
Solution: Cache the Entire Result in a Struct
Both successes and failures must be treated equally as "the result of checking that URL." To do this, I needed to store a struct containing both the status code and the error object in the cache.
type checkResult struct {
status int
err error
}
urlCache.Store(url, &checkResult{status: status, err: err})
By correctly implementing this Negative Caching (caching failure results), the consistency during parallel execution was finally fully maintained.
Conclusion: The Never-Ending Road to Stable Concurrency
By assembling the three sacred treasures of "caching," "semaphores," and "retries," and further implementing Negative Caching to include failure results, my external link checker became dramatically more stable.
However, even with all this, I can't say that I've managed to eliminate "Flaky Tests" 100% in my environment.
Something Is Still Missing
No matter how many countermeasures you stack up, arriving at the "correct" solution for network-related concurrency is not easy. For example, the following cases remain as concerns:
- Cache Stampede: The problem where a second and third Goroutine simultaneously initiate requests for the same URL in the "brief moment" before the first cache entry is written.
- Fine-tuning
http.Client: How do low-level settings likeMaxIdleConnsPerHostaffect performance during parallel execution? - CI Environment-Specific Fluctuations: Unpredictable packet loss and latency in a virtualized network environment.
The world of concurrent programming is deep, and every time you overcome one wall, a new one appears. That is both the difficulty and the fun of this development process.
Please Lend Me Your Expertise (A Request to Readers)
The code incorporating these processes is currently available as gomarklint.
If you've read this article and have any insights, such as "this part looks suspicious," "for Go's http.Client, you should use this setting," or "for Cache Stampede, you should use singleflight," I would be very grateful if you could share them.
- GitHub Issues / PRs: Specific improvement suggestions and bug reports are highly welcome.
- Comments / Social Media: Stories about "how I solved it" are also very encouraging.
The journey to achieve both "blazing speed" and "absolute stability" continues. I would be delighted if you could help me grow this linter into a more robust tool with your expertise.
Questions about this article 📝
If you have any questions or feedback about the content, please feel free to contact us.Go to inquiry form
Related Articles
Complete Cache Strategy Guide: Maximizing Performance with CDN, Redis, and API Optimization
2024/03/07Getting Started with Building Web Servers Using Go × Echo: Learn Simple, High‑Performance API Development in the Shortest Time
2023/12/03Go × Gin Advanced: Practical Techniques and Scalable API Design
2023/11/29Go × Gin Basics: A Thorough Guide to Building a High-Speed API Server
2023/11/23Article & Like API with Go + Gin + GORM (Part 1): First, an "implementation-first controller" with all-in-one CRUD
2025/07/13Building an MVC-Structured Blog Post Web API with Go × Gin: From Basics to Scalable Design
2023/12/03Robust Test Design and Implementation Guide in a Go × Gin × MVC Architecture
2023/12/04Bringing a Go + Gin App Up to Production Quality: From Configuration and Structure to CI
2023/12/06