Flaky Tests: Why They Happen and How to Eliminate Them

The visible cost of a flaky test is the time spent re-running CI. The real cost is the moment your team stops treating red builds as worth investigating, because once a few failures are 'probably flaky,' real bugs get the same treatment. This guide covers the five root causes of intermittent failures in Playwright: async timing, test pollution, shared state, network dependencies, and selector instability, with the diagnostic steps and concrete fixes for each.

The real cost of flakiness

The obvious cost is time: developers re-running pipelines, testers investigating failures that turn out to be nothing, engineers spending a Friday afternoon bisecting a test that "just started acting up." That time adds up fast. A conservative estimate for one persistently flaky test in a busy team is 30–60 minutes of investigation per week.

The hidden cost is worse. When failures are unreliable, every failure becomes suspect. Real bugs get dismissed. The instinct to act on a red build (which is exactly what CI is designed to build) erodes. Eventually your test suite is green on merge, red on main, and nobody bats an eye.

There's also a psychological cost. Flaky tests make test automation feel untrustworthy and fragile. Junior engineers on the team start believing that automation is just inherently unreliable, which shapes how they write tests going forward.

The fix starts with a clear-eyed look at root causes rather than reaching for --retries.

Race conditions and async timing: the number one cause

The overwhelming majority of flaky tests in Playwright come from timing problems. The test tries to click a button before the button is ready, or asserts on text before the network request that populates it has finished. On a fast machine it works. On a slow CI runner it doesn't.

The instinct is to add a sleep:

// The wrong fix — still flaky, just slower
await page.waitForTimeout(3000);
await page.getByRole('button', { name: 'Save' }).click();

This makes the test three seconds slower and still fails on a bad CI day. You've traded one problem for two.

Playwright auto-waits for most things automatically. When you call locator.click(), Playwright waits for the element to be visible, stable, and not obscured before it acts. The test only becomes flaky when you short-circuit that behavior or when you're waiting on something Playwright doesn't know about, like an animation finishing or a spinner disappearing.

The right fix: wait for the specific condition that must be true before you act.

// Wait for a loading spinner to disappear before interacting
await page.getByTestId('loading-spinner').waitFor({ state: 'hidden' });
await page.getByRole('button', { name: 'Save' }).click();

// Wait for a network response that populates the page before asserting
await page.waitForResponse(
  (resp) => resp.url().includes('/api/products') && resp.status() === 200
);
await expect(page.getByRole('table')).toBeVisible();

// Wait for a button to become enabled after form validation
const saveButton = page.getByRole('button', { name: 'Save' });
await expect(saveButton).toBeEnabled();
await saveButton.click();

Each of these waits for a real condition rather than guessing a duration. Playwright polls the condition with a configurable timeout (30 seconds by default), so the test is both reliable and as fast as the app allows.

If you find yourself writing waitForTimeout more than once a week, treat it as a code smell. Each instance is a test that will be flaky under load. Replace each one with a condition-based wait.

Test ordering and shared state

Tests that pass alone but fail when the full suite runs are almost always leaving state behind. One test creates a record, the next test trips over it. One test sets a cookie, the next test behaves differently because of it. One test changes a user's settings, and every subsequent test for that user is now in an unexpected state.

// This test leaves a "Test Item" in the database every time it runs
test('add item to inventory', async ({ page }) => {
  await page.goto('/inventory');
  await page.getByRole('button', { name: 'Add Item' }).click();
  await page.getByLabel('Name').fill('Test Item');
  await page.getByRole('button', { name: 'Save' }).click();
  await expect(page.getByText('Test Item')).toBeVisible();
  // Nothing gets cleaned up
});

// This test now fails if the previous test ran first — it finds 2 items when expecting 1
test('inventory shows one item', async ({ page }) => {
  await page.goto('/inventory');
  await expect(page.getByRole('row')).toHaveCount(2); // 1 data row + 1 header
});

The fix is isolation. Every test should set up its own state and clean it up afterward. Playwright fixtures are the right tool for this: they run setup before each test and teardown after, even if the test fails.

import { test as base } from '@playwright/test';

type TestFixtures = {
  testItem: { id: string; name: string };
};

const test = base.extend<TestFixtures>({
  testItem: async ({ request }, use) => {
    // Create the item before the test
    const response = await request.post('/api/inventory', {
      data: { name: `Test Item ${Date.now()}` },
    });
    const item = await response.json();

    await use(item); // run the test

    // Clean up after, even if the test failed
    await request.delete(`/api/inventory/${item.id}`);
  },
});

test('inventory item shows detail page', async ({ page, testItem }) => {
  await page.goto(`/inventory/${testItem.id}`);
  await expect(page.getByRole('heading', { name: testItem.name })).toBeVisible();
});

Fixtures guarantee teardown. afterEach blocks don't run if a test crashes during setup. Fixtures do. That's the difference between isolation that mostly works and isolation that always works.

Environment-dependent flakiness

Some tests work perfectly on your MacBook and fail every other run in GitHub Actions. The environment difference is doing the work. Common culprits:

Timezone. new Date() returns different values depending on where the test runs. A test that asserts on a formatted date string will fail in CI if the runner is in UTC and your local machine is in UTC+3.

// Flaky — depends on the machine's local timezone
const today = new Date().toLocaleDateString('en-GB');
await expect(page.getByTestId('report-date')).toHaveText(today);

// Stable — pin the locale and timezone explicitly
const today = new Date().toLocaleDateString('en-GB', { timeZone: 'UTC' });
await expect(page.getByTestId('report-date')).toHaveText(today);

Random test data. If you generate IDs or names without a seed, parallel test runs can collide on the same value or produce values that happen to match existing records.

// Risky in parallel — two workers might generate the same ID in the same millisecond
const id = Date.now();

// Better — combine timestamp with worker index
const id = `${Date.now()}-${workerInfo.workerIndex}`;

Viewport and resolution. Elements that are visible at 1920px might be hidden behind a hamburger menu at the default CI viewport. Set a consistent viewport in playwright.config.ts and don't rely on responsive breakpoints unless you're explicitly testing them.

// playwright.config.ts
export default defineConfig({
  use: {
    viewport: { width: 1280, height: 720 },
  },
});

Selector instability

Dynamic class names and generated IDs are a trap. Frameworks like Tailwind and CSS Modules generate class names that include content hashes. Compiled apps sometimes generate element IDs based on build order. A selector that worked yesterday breaks after a dependency update.

// Fragile — this class name is generated and will change
await page.locator('.tw-btn-primary-3af82').click();

// Fragile — generated ID, meaningless and unstable
await page.locator('#ember-423').click();

// Also fragile — nth-child is order-dependent
await page.locator('ul > li:nth-child(3)').click();

Playwright's semantic locators tie the selector to what the element does rather than how it's styled or structured. They're stable across refactors because they reflect user-facing semantics, not implementation details.

// Stable — role + accessible name
await page.getByRole('button', { name: 'Submit' }).click();

// Stable — label text
await page.getByLabel('Email address').fill('user@example.com');

// Stable — test ID (add data-testid to the element if needed)
await page.getByTestId('submit-button').click();

// Stable — visible text
await page.getByText('Order confirmed').waitFor();

data-testid attributes are a deliberate contract between the test and the application. They survive CSS refactors, layout changes, and framework upgrades. If your app doesn't have them yet, start adding them to high-value interactive elements.

When you have to write a CSS or XPath selector, scope it tightly and anchor it to a stable parent:

// Scoped to a named section, not the whole page
const orderSummary = page.getByTestId('order-summary');
await expect(orderSummary.getByRole('cell', { name: 'Total' })).toBeVisible();

Network-dependent tests

Tests that hit real external services are inherently flaky. A third-party API can be slow, rate-limited, or temporarily unavailable. A test that calls the live Stripe API in CI is not testing your code. It's testing whether Stripe is up.

The pattern to recognize: any test that makes a real HTTP call to something outside your control is a flaky test waiting to happen.

For external APIs, mock at the network level:

test('checkout completes with payment confirmation', async ({ page }) => {
  // Intercept the Stripe API call and return a controlled response
  await page.route('**/api/stripe/charge', async (route) => {
    await route.fulfill({
      status: 200,
      contentType: 'application/json',
      body: JSON.stringify({
        id: 'ch_test_123',
        status: 'succeeded',
        amount: 4999,
      }),
    });
  });

  await page.goto('/checkout');
  await page.getByLabel('Card number').fill('4242 4242 4242 4242');
  await page.getByRole('button', { name: 'Pay now' }).click();
  await expect(page.getByText('Payment confirmed')).toBeVisible();
});

For slow internal endpoints, use page.waitForResponse with a generous timeout rather than hoping the response arrives within the default action timeout:

test('large report generates successfully', async ({ page }) => {
  await page.goto('/reports');
  await page.getByRole('button', { name: 'Generate Report' }).click();

  // Wait up to 60 seconds for this specific slow endpoint
  await page.waitForResponse(
    (resp) => resp.url().includes('/api/reports/generate') && resp.status() === 200,
    { timeout: 60_000 }
  );

  await expect(page.getByRole('link', { name: 'Download Report' })).toBeVisible();
});

If you're testing against a real API that you own, consider using a dedicated test environment that you can reset between runs. A test database seeded to a known state before each run eliminates an entire class of flakiness.

The retry trap

Playwright supports automatic retries and they're genuinely useful, but they're also the most commonly misused tool in the flaky-test toolkit.

// playwright.config.ts
export default defineConfig({
  retries: process.env.CI ? 2 : 0,
});

This configuration is reasonable as a last line of defense against genuine infrastructure flakiness: a momentary network hiccup in CI, a CI runner that occasionally fails to start a browser. It's not a fix for tests that have real problems.

Here's the thing about retries: a test that passes on the third attempt still consumes the time of the first two failures. With retries: 2, a suite that takes 10 minutes on clean runs can take 25–30 minutes when multiple tests are flaky. You've hidden the failures while making the pipeline worse.

Retries are acceptable when:

The flakiness is demonstrably infrastructure-related (CI runners, browser launch failures, network timeouts to your own services in the same datacenter)
You've investigated and confirmed there's no test code problem
You're treating retries as temporary while a deeper fix is in progress

Retries are harmful when:

They're the first response to a new flaky test
They're covering up timing issues or isolation problems in the test code
The retry count is being increased over time as the suite gets worse

// Wrong: masking a real timing problem
export default defineConfig({
  retries: 5, // This test kept failing, so we just added more retries
});

// Right: retries as a safety net with a fixed, low limit
export default defineConfig({
  retries: process.env.CI ? 1 : 0,
  // Actual timing and isolation problems are fixed in the test code
});

If you're increasing retries over time rather than decreasing it, your flaky test problem is getting worse, not better. The counter of current retries is a health metric. It should trend toward zero.

Systematic investigation: how to diagnose a flaky test

When a test starts failing intermittently, work through this sequence rather than guessing and tweaking.

Step 1: Reproduce it deterministically. Run the test 20 times in a row:

npx playwright test tests/checkout.spec.ts --repeat-each=20

Count the failures. A test that fails 1 in 20 runs is mildly flaky. A test that fails 15 in 20 runs has a real problem. This also tells you how much effort the fix is worth: a 5% failure rate in a suite that runs 50 times a day still hits you 2–3 times per day.

Step 2: Isolate it. Run only that test file. If it passes reliably in isolation but fails in the full suite, the problem is test pollution from another test running before it.

# Run in isolation
npx playwright test tests/checkout.spec.ts

# Run in suite order to reproduce pollution
npx playwright test --workers=1

Step 3: Capture the trace. Enable tracing for failures and look at the exact moment the test broke:

// playwright.config.ts
export default defineConfig({
  use: {
    trace: 'on-first-retry',
  },
  retries: 1, // retry once to trigger trace capture
});

npx playwright test tests/checkout.spec.ts
npx playwright show-report

The trace viewer shows a timeline with before/after screenshots for every action, all network requests, and console logs. In the majority of cases, the failure point in the trace immediately reveals whether the problem is timing, a missing element, or an unexpected network response.

Step 4: Run headed with slow motion. If the trace isn't conclusive, watch the test run:

npx playwright test tests/checkout.spec.ts --headed --slow-mo=500

Slow motion adds a 500ms pause between actions. What looks instantaneous in a normal run becomes visible, and you'll often see the exact moment the UI isn't ready for the next interaction.

Step 5: Check what runs before it. If isolation testing revealed pollution, find the preceding test:

# Run with a single worker to get a deterministic order, then check which test ran before the failing one
npx playwright test --workers=1 --reporter=list

Look for tests in the preceding file that create records, set cookies, or modify application state without cleanup.

Step 6: Apply the right fix. Based on what you found:

Timing problem: replace waitForTimeout with a condition-based wait
Test pollution: add cleanup in afterEach or convert setup/teardown to fixtures
Selector instability: switch to getByRole, getByLabel, or getByTestId
Network dependency: mock the external call with page.route
Environment difference: pin timezone, viewport, and any values that vary by machine

Most flaky tests fall into step 2 (isolation) or step 3 (trace viewer). The investigation rarely needs to go all the way to step 6.

FAQ

How do I know if a test is flaky or if it actually caught a bug?

Run it 10 times on the same commit without code changes. If it fails 1–3 times out of 10, it's flaky. If it fails consistently (7 or more out of 10), it's likely caught a real regression. The distinction matters because flaky tests need investigation while consistent failures need a bug fix.

My test only fails in CI, not locally. What's different?

CI runners are typically slower, headless, and in a different timezone. The most common CI-specific causes are timing issues that local hardware masks (the page loads fast enough locally that the race condition never triggers), headless rendering differences for animations, and timezone mismatches in date assertions. Run locally with --slow-mo=500 to simulate a slower machine, and double-check any date formatting for timezone assumptions.

Should I use test.skip or test.fixme for a known flaky test? test.skip excludes it entirely. test.fixme marks it as expected to fail: the test still runs, is expected to fail, and becomes a visible alert if it starts passing (which might mean the underlying issue changed). For a genuinely flaky test with no immediate fix, test.skip with a comment explaining why and linking to the tracking issue is the better choice. An unexplained test.fixme is just confusion waiting to happen. I added a data-testid but the test is still flaky. What else can I check?

A stable selector doesn't guarantee a stable test. After fixing the selector, check whether the element is being acted on before it's ready (timing), whether there's conflicting state from another test (isolation), and whether the test passes in isolation but fails in suite (pollution). Selector stability and test isolation are separate problems.

We have 40 flaky tests. Where do we start?

Sort by failure rate, not by how annoying they are. Fix the tests that fail most often first: they're degrading CI reliability the most. As you fix them, patterns will emerge: if 15 of them share the same root cause (say, a spinner that needs to disappear before interactions), a single fix pattern applies across all of them.

The real cost of flakiness

Race conditions and async timing: the number one cause

Test ordering and shared state

Environment-dependent flakiness

Selector instability

Network-dependent tests

The retry trap

Systematic investigation: how to diagnose a flaky test

FAQ

Continue reading