Most Test Frameworks Are Over-Engineered

I want to tell you about a test framework.

It was beautiful. It had a custom DSL that read like English. Every Selenium action was wrapped in a helper that handled retries, logging, and screenshots automatically. We had a single custom RSpec matcher that every line of every test ran through — uniform syntax everywhere, predictable structure, very tidy on the page. We didn’t trust Selenium’s built-in waits, so we rolled our own polling logic. Tests were short. Test code looked clean. Onboarding new SDETs was a one-day affair, on the surface.

It was a Ruby framework — Selenium plus RSpec, written back when the org was a Ruby on Rails shop and a Ruby test stack was the obvious call. By the time I inherited it, the Rails shop had moved on, but the framework hadn’t. And that’s where the bill started arriving.

I’m going to tell on this framework for a while, because the lessons it taught me apply to almost every “ergonomic” test framework I’ve seen in the years since. And I’m going to tell on myself, too, because I inherited this thing and didn’t push to rip it out for a long time. It worked. That phrase has covered a multitude of sins in our industry, and it covered mine.

Toolboxes, not opinions

Here’s the thesis, before the war story:

A test framework should be a toolbox, not an opinion.

Its job is to handle the things the person writing the test shouldn’t have to think about — environment setup, secrets, logging, screenshots on failure, parallel coordination, that kind of plumbing. If a test writer needs to reach out to Twilio to verify a text message arrived, the framework should give them Twilio.last_message_to(phone) and disappear. That’s what abstraction is for.

The actual middle of the test — the locators, the clicks, the assertions, the waits, the network interceptions — should look as close as possible to whatever the underlying tool already gives you. If Selenium has a WebDriverWait that does the right thing, the test writer should learn WebDriverWait. If Playwright has page.getByRole, that’s what shows up in the test. The framework’s job is not to replace the underlying tool. It’s to quietly handle the things around the underlying tool.

This sounds obvious when written down. In practice, almost nobody does it.

What we built

The Ruby framework, in shape:

Every action was wrapped. click_submit_button looked clean, but it was a half-page method that did its own existence check, its own visibility check, its own retry logic, its own logging, and finally got around to clicking. Multiply this by every action — and every page object had a method per element — and you have a small private programming language inside RSpec.
One custom RSpec matcher to rule them all. Every wrapped action returned a string — "PASS" or "FAIL" — and a single matcher checked the string. Every line of every test looked like this:expect(@login_page.click_submit_button).to pass expect(@login_page.enter_username("kevin")).to pass expect(@dashboard_page.welcome_banner_visible?).to pass Every. Single. Line. The same matcher. On paper, this was elegant — uniform syntax, predictable structure. In practice, it meant that by the time the matcher saw the result, every piece of useful diagnostic information about why something failed had already been thrown away inside the wrapper. RSpec’s whole strength is rich failure output; we’d built a system that handed it nothing.
A custom reporter to compensate. Because RSpec’s default output was now useless to us — every failure just said "expected to pass, got FAIL" — we wrote our own reporter to surface the diagnostic detail we’d swallowed three layers earlier. Layer N+1 to fix a problem layer N created. This is how frameworks die.
Custom waits. We didn’t trust Selenium::WebDriver::Wait, so we rolled our own polling helpers with our own timeouts and our own sleep logic. Every test that needed to wait for something used our wait, not Selenium’s.
A deep page object hierarchy. Base page inherited helpers that called other helpers that called the wrappers. Tracing a single action through the stack required holding three or four files in your head at once.
And all of it lived in a single Ruby gem, shared across every business unit that did automation. One gem to serve every team, every product, every dependency, every edge case. Which meant the gem couldn’t say no to anyone — every special case anyone needed had to be supported, every dependency anyone pulled in had to be reconciled, every breaking change had to be coordinated across teams that didn’t share a release schedule. The framework wasn’t just over-engineered. It was structurally required to be over-engineered, because we’d put the entire org’s testing in one shared library and given it no way to refuse a request.

Each of these decisions had a defensible reason at the time. Each of them, in isolation, was a small ergonomic win. Stacked together, over years, they were a trap.

The bill

The first cost showed up in onboarding. Or, more precisely, in re-onboarding — bringing in a new SDET who already knew Ruby, already knew Selenium, already knew RSpec, and still couldn’t write a test on day one because nothing they knew applied. Our framework had reinvented every piece of public knowledge they were carrying with them. To learn our framework, you had to learn it from scratch, from us. There was nowhere else to look.

You couldn’t Google a problem. Stack Overflow had no idea what our matchers did. The Selenium docs were no help, because we’d hidden Selenium behind two layers of wrappers. The RSpec docs explained how matchers worked in general, but said nothing about our matchers. Every question routed back to one of two or three people on the team who knew the framework deeply.

One of those people was me. I was the bottleneck. I didn’t realize it for an embarrassingly long time, because being the bottleneck feels like being valuable.

The second cost showed up when we tried to move from running tests locally to running them in CI. This should have been easy. Instead it turned into a bizarre file-reading, string-parsing, pipeline-splitting Frankenstein. Our custom wait logic had timing assumptions baked in that worked on a developer’s MacBook and fell apart on a slow build agent. Our action wrappers logged in a way that was readable in a terminal and useless in a CI artifact. Our config layer assumed a specific local file structure. None of this was Selenium’s fault. None of it was Ruby’s fault. It was all things we had built ourselves, on top of things that already worked, and we were now paying to undo our own cleverness.

We got it working eventually. The solution we built to bridge the gap was, of course, more wrapper code. The framework grew another layer.

What good looks like

The Java Playwright framework I’m building now is, deliberately, the inverse of all of that.

The base test class handles tracing, video recording, screenshots on failure, parallel coordination, environment switching, and authentication setup. None of that touches the test body. None of it has its own custom DSL. Once a test is running, it has a Page object — Playwright’s Page object, not mine — and the test writer uses Playwright’s API directly. page.getByRole(AriaRole.BUTTON, ...), page.click(...), page.waitForResponse(...). Whatever Playwright supports, the test writer can reach for. Whatever Playwright documents, the test writer can read.

When something complicated needs handling — SMS verification, distributed tracing headers for downstream services, time zone manipulation for user accounts — that gets a small helper or fixture with a one-line API and disappears behind it. The test writer calls Twilio.lastMessageTo(phone) and never thinks about the rest. That’s where wrapping is earning its keep: hiding genuine complexity, not hiding things that aren’t actually complex.

The result is a test that any Playwright user, anywhere on the internet, could read. Including the AI tools they’ll inevitably ask for help. Including the new hire on day one. Including me three years from now, when I’ve forgotten what I was thinking when I wrote it.

How to know if you’ve gone too far

A few smell tests, in roughly increasing severity:

Can a new hire read a test on day one? Not write one — read one, and roughly understand what’s happening. If not, you’ve abstracted past the line.
Can they Google their way out of a problem? If the answer to most questions is “ask the framework owner,” your framework is a single point of failure with extra steps.
If you renamed your framework’s wrappers back to the underlying tool’s API, would anything break that wasn’t already broken? If the answer is “no, that would be a net improvement,” you’ve built abstractions that aren’t earning their keep.
When something fails in CI but passes locally, does the failure point at your framework or at the application? If your framework is showing up in failure investigations more often than it’s showing up in successes, that’s a tell.

The point of a test framework is not to be admired. It’s to disappear. The best ones do their job and get out of the way of the test writer, leaving them with the underlying tool, the underlying language, and the underlying documentation — all of which were written by people smarter than us, with more reviewers than us, and which will outlive whatever clever DSL we were going to bolt on top.

The rule, one more time

Handle what the test writer shouldn’t care about. Leave alone what they should.

That’s the whole framework philosophy. Everything else is optimization, and most of the time, it’s optimization for the wrong thing — for the framework writer’s sense of craft instead of for the test writer’s ability to ship.

If you’re staring at a homegrown matcher library, a custom wait helper, or a five-method-deep wrapper around click, ask the question I should have asked years earlier: what underlying capability did we hide, and why did we think we needed to hide it?

Sometimes the answer is good. Most of the time, it isn’t.

Discover more from Go Forth And Test

Subscribe to get the latest posts sent to your email.