The Flaky Tests Truce

There are two camps in the flaky test discourse, and they could not be further apart.

The first camp refuses to retry a failed test under any circumstances. To them, a retry is a cover-up — it hides real signal, papers over instability, and lets bad tests stay in the suite indefinitely. The pitch is clean: if a test isn’t reliable enough to pass the first time, it isn’t reliable enough to be in the suite. Failed runs stay failed. Engineers fix the test or delete it. The bar is high on purpose, because the bar is what keeps the suite trustworthy. If you’ve ever inherited a suite with a hundred “known flakes” and tried to dig out, you understand exactly where this camp is coming from.

The second camp has gone the other direction and made peace with the chaos. Flaky tests scattered through the suite, half the team’s Slack channel devoted to “is this one real?”, and a collective shrug whenever someone asks. Oh, that test? Yeah, it’s just flaky. Rerun it. The pitch here is also clean: tests are infrastructure, the product is what matters, and we have releases to ship. Spending an afternoon chasing a 1% flake on a Tuesday is a worse use of engineering time than just clicking “rerun” and getting on with it. So that’s what they do. Every time.

I’d like to propose a truce.

Tests should serve the product

This is the load-bearing idea, so I’m putting it up front.

The point of an automated test suite is not to be beautiful. It is not to be deterministic for its own sake. It is to give you confidence that your product hasn’t regressed, fast enough that you can actually ship. Everything else — the architecture, the helpers, the page objects, the dashboards — is in service of that.

Once you internalize that, a lot of flaky-test arguments dissolve. A test that flakes 1% of the time but passes on retry is still telling you the truth about the product. The product didn’t regress. You found that out. You shipped. The system worked.

A test you delete because it occasionally flakes tells you nothing about the product, ever again. That is a much worse outcome than a retry.

Retry your failed tests. With receipts.

I am 100% in favor of retrying failed tests. The math is simple: a flaky test that passes on retry costs you a few minutes of CI time. A flaky test that fails the build costs you an engineer’s afternoon, a Slack thread, three people pulled in to “take a look,” and — worst of all — slowly trains your team to assume red builds are noise. That last cost is the one that kills you.

The catch, and this is non-negotiable: you have to record every retry. A retry that nobody sees is just a slow way to hide a problem. A retry that lands in a dashboard is data.

Most modern test runners save every attempt by default — Playwright certainly does — so you already have the raw material. Build a simple dashboard on top of it. Test name, pass rate over the last N runs, retry rate, last failure timestamp. That’s it. You don’t need Datadog. At my current job, we run a lightweight REST service that logs every test attempt and its result, with a small dashboard sitting on top of that data. Nothing fancy. But once you can see the pattern, you can act on it.

If your runner doesn’t save every attempt, find a way to make it save every attempt. Wrap the runner. Pipe the JUnit XML somewhere. Add a JUnit 5 extension. Whatever it takes — this is worth a day of plumbing because without it you are flying blind, and flying blind is how flaky tests turn into ignored tests.

Your flaky test might be a flaky bug

Here’s the part that should give the rerun-and-forget crowd pause, and the reason the dashboard matters so much.

A test that flakes is, by definition, telling you that the system under test is non-deterministic from the test’s point of view. Sometimes that’s the test’s fault — a bad locator, a missing wait, a shared piece of state. Plenty often, though, it isn’t. The test is correct. The product is the thing that’s misbehaving 2% of the time. And because it’s only 2%, no real user has filed a ticket about it yet.

Congratulations. You have not found a flaky test. You have found a flaky bug.

A few patterns I look for, ordered loosely from “this is on you” to “this is on the product”:

  • Same step, same data, different runs. Usually test infra — network blips, parallelism, shared state between tests. Annoying, but yours to fix.
  • Failures track with specific data. If the same fixture or user account fails far more often than the others, the test is fine. You’ve found an edge case in the code path the test happens to run through.
  • Only fails under load, or only in CI. A real race condition in the application, surfacing when something else is competing for resources. Locally, with one test running, the bug hides.
  • Flake rate creeps up over time, with no test changes. The application drifted underneath the test. New async path, performance regression, a service got slower — the test is the messenger.
  • Fails on one browser, passes on the others. A real cross-browser bug. Especially Safari. Especially iOS Safari. You knew I was going to say that.

The dashboard is what makes these patterns visible. Without it, every flake looks like every other flake, and the rational move becomes “rerun and forget.” With it, you can see that the checkout test only flakes on Wednesday afternoons, which is when the analytics team kicks off their batch job, which is hammering a service the checkout depends on, which is the actual story.

Don’t be precious. Don’t be defeated.

The truce, then, looks like this:

  • Flaky tests are a fact of life in any test suite that touches a real browser or a real device. Pretending otherwise costs you more than it saves.
  • Retry on failure. Be unembarrassed about it. The product is what matters.
  • Track every retry. A flake you can’t see is a flake you can’t fix.
  • Use slack time — the half-hour between meetings, the slow Friday afternoon — to work through the top of the dashboard. Not all of it. The top of it.
  • When you investigate, hold open the possibility that the test is right and the product is wrong. That’s where the highest-value bugs hide.

What you’re aiming for isn’t zero flakes. It’s a flake rate low enough that a red build means something, and a system honest enough about its own flakes that you can trust the green ones.

That’s the deal. Take pride in your tests – but also, don’t be defeated by them. And for the love of all things good in the world, stop deleting the ones that pass on retry — they’re trying to tell you something.


Discover more from Go Forth And Test

Subscribe to get the latest posts sent to your email.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top