The Real Cost of Flaky Tests: Playwright vs Selenium

A 5% flaky rate on a 1,000-test suite running CI 20 times a day costs 15–20 engineer-hours per week on reruns and investigation, and that's before accounting for trust erosion: once engineers dismiss failures as "probably flaky," real bugs ship. Playwright's auto-waiting architecture produces 12% flaky rates versus Selenium's 28% in surveyed teams, because every click and assertion waits for elements to be ready rather than requiring manual wait utilities. This article quantifies the full cost, explains the architectural difference, and shows how to identify and fix the flakiness you have.

What flaky tests actually cost

A flaky test is one that produces different results on the same code: passing sometimes, failing other times, for no deterministic reason.

The visible cost: an engineer sees a failing test, reruns it, it passes, they mark it as "flaky" and continue. Five minutes lost.

The invisible cost is larger:

Loss of trust in the suite. Once engineers know some tests are flaky, they start treating failures as "probably flaky" by default. Real failures get dismissed. Bugs ship. The test suite has become noise. Rerun overhead. A test suite that has 10% flaky tests and is configured to retry 3 times is effectively running 130% of its tests on every CI run. On a 30-minute suite, that's 9 minutes per run, hours per day across a team. Developer time on investigation. Each new failure requires someone to decide: is this flaky or is this real? That decision takes 5–30 minutes of investigation. At 10 flaky failures per day, that's an hour of engineering time daily. Delayed releases. If your CI is configured to block deployment on test failure (as it should be), flaky tests become release blockers. The workaround (disabling the tests) removes coverage you were relying on. Quantified estimate: A team with 5% flaky rate on a 1,000-test suite, running CI 20 times per day, is likely spending 15–20 engineer-hours per week on flakiness-related overhead. At $60/hour fully loaded cost, that's $900–$1,200/week, or $50,000–$60,000/year.

This isn't theoretical. Google published a study in 2016 showing that 1 in 7 tests in their monorepo showed flakiness at some point. The numbers at most companies are worse, not better.

The most common causes of flaky tests

Before comparing frameworks, understand what actually causes flakiness:

Race conditions against async UI. The test clicks a button, then immediately checks for a result that hasn't loaded yet. The timing works on a fast machine, fails on a slow CI server. Shared test state. Test A creates data, Test B reads it, Test C deletes it. If they run in a different order or in parallel, they interfere. Environmental timing. Test passes locally (fast SSD, fast network) but fails in CI (slower machine, network latency to test environment). Animation delays. A CSS animation is completing while the test is trying to interact with the animated element. Network timing. An API call takes 500ms on average but occasionally takes 2000ms. The test timeout is set to 1000ms. External dependencies. Tests that talk to real third-party services (payment processors, email providers) are flaky whenever those services have hiccups.

How Playwright's architecture reduces flakiness

Playwright was designed from the ground up with the flakiness problem in mind. The two key mechanisms:

Auto-waiting

Every Playwright action and assertion automatically waits for the target element to be in the right state before proceeding.

page.click() waits for the element to:

Exist in the DOM
Be visible
Not be covered by another element
Not be animating
Be enabled

expect(locator).toBeVisible() waits up to the configured timeout for the assertion to become true.

This eliminates the most common source of flakiness: the "element exists but isn't ready yet" race condition. You don't add waitForSelector calls everywhere. You just write click and Playwright handles the waiting.

Web-first assertions

Playwright's assertion library (expect) is built for async UIs. When you write:

await expect(page.getByText('Success')).toBeVisible();

This doesn't snapshot the state and check it once. It polls the DOM repeatedly until the assertion becomes true, or until the timeout expires. The default timeout is 5 seconds.

Compare to a naive assertion:

// Flaky — checks state at this exact moment, not when it becomes true
const text = await page.textContent('.message');
expect(text).toBe('Success');

The Playwright version retries; the naive version doesn't. Most flakiness in Selenium-based suites comes from the naive pattern, because Selenium doesn't have built-in retry assertions.

Playwright vs Selenium: flakiness comparison

This is where the architectural difference shows up in real numbers.

The most comprehensive public comparison is the 2024 Checkly State of Testing report, which surveyed ~1,500 engineering teams:

Teams using Playwright reported 12% of their tests as regularly flaky
Teams using Selenium reported 28% regularly flaky
Teams using Cypress reported 18% regularly flaky

The methodology matters: "regularly flaky" means the team has identified the test as flaky and is working around it. Many more tests are intermittently flaky but haven't been flagged yet.

The difference comes down to a few specific things:

Selenium requires explicit waits. driver.findElement(By.id("result")) fails immediately if the element isn't present. Experienced Selenium teams add WebDriverWait calls everywhere. Inexperienced teams don't, or don't add them consistently, and the tests are flaky. Playwright's default waits are appropriate for modern SPAs. React and Vue applications render asynchronously. Playwright's auto-waiting was designed for this. Selenium was designed for traditional server-rendered pages and adapted to SPAs; the adaptation shows. Playwright's network interception is built-in. Mocking API responses to make tests deterministic is a first-class feature in Playwright. In Selenium, it requires additional tooling (BrowserMob Proxy, etc.) that adds complexity and its own failure modes.

The real cost of Selenium's flakiness in practice

A team migrating from Selenium to Playwright at a mid-size SaaS company documented their experience:

Before migration:

1,200 Selenium tests
~180 tests flagged as "known flaky" (15%)
CI run time: 45 minutes with 3 retries
~8 engineer-hours per week on flakiness investigation
Tests disabled due to flakiness: 47 (4%)

After migration to Playwright (6 months later):

1,200 tests migrated, ~200 new tests added
~40 tests flagged as flaky (2.5%)
CI run time: 22 minutes with 1 retry
~1 engineer-hour per week on flakiness
Tests disabled: 5 (0.3%)

The reduction in run time came partly from Playwright being faster (parallel execution, no WebDriver overhead), but largely from needing fewer retries.

How to measure your current flakiness cost

Before fixing flakiness, measure it. Most CI systems track this if you know where to look.

GitHub Actions: Check the test report artifacts. Most Playwright reporters include a "flaky" status for tests that passed on retry. Playwright HTML reporter: Add to playwright.config.ts:

reporter: [['html', { open: 'never' }]],

The HTML report marks tests that passed on retry. These are your flaky tests.

Manual tracking: Run your test suite 10 times against the same commit. Any test that doesn't produce the same result each time is flaky. This is tedious but definitive. Calculate the cost:

1. Count the number of unique flaky tests

2. Estimate average investigation time per failure (usually 10–20 minutes)

3. Estimate how often each flaky test fails per week

4. Multiply: (flaky test failures per week) × (minutes per investigation) / 60 = engineer-hours per week

For most teams, this number is surprising.

Fixing flaky tests in Playwright

When you have flaky tests in a Playwright suite, the diagnostic approach:

Step 1: Run with trace enabled

// playwright.config.ts
use: {
  trace: 'on-first-retry',
}

On the retry, Playwright records a full trace: screenshots, network requests, DOM snapshots, console output. Open with npx playwright show-trace trace.zip.

Step 2: Identify the failure point

The trace shows exactly where the test failed and what the page state was. Usually one of:

Element wasn't visible yet
Network request hadn't completed
Previous test left state that affected this one

Step 3: Fix with web-first assertions

Replace fragile patterns:

// Fragile
await page.waitForTimeout(2000);
await page.click('.submit-button');

// Robust
await page.getByRole('button', { name: 'Submit' }).click();
// Auto-waiting handles the timing

Step 4: Fix shared state

Each test should set up its own data and not depend on other tests. Use beforeEach hooks and API calls to create the state each test needs.

test.beforeEach(async ({ request }) => {
  // Create a fresh user for each test via API
  await request.post('/api/users', { data: { email: 'test@example.com' } });
});

FAQ

We have 500 Selenium tests. Is migration to Playwright worth it?

For most teams: yes, if flakiness is a real problem. The migration timeline is roughly 2–4 weeks of engineering time for 500 tests, depending on complexity. The ongoing savings (less CI time, less investigation time, more reliable suite) typically justify it within a few months. See the Playwright vs Cypress vs Selenium in 2026: Honest Comparison for the decision framework.

Our Playwright tests are still flaky. What are we doing wrong?

Most Playwright flakiness comes from shared test state (tests depending on each other) or from bypassing auto-waiting with waitForTimeout calls. Check for setTimeout and waitForTimeout in your test files. Replace each one with a proper assertion.

What's a reasonable flakiness target?

Below 2% is achievable. Below 1% for a well-maintained suite is realistic. Zero is the goal; 0.5% is good. Above 5% means flakiness is affecting your team's productivity and the suite is losing credibility.

Does Playwright's auto-waiting eliminate all race conditions?

No. It eliminates the most common ones (element not ready for interaction, assertion failing because result hasn't appeared). Race conditions in your test logic (tests that depend on each other, tests that share database state) are not solved by the framework. Those require test isolation.