An AI model generates the most probable continuation of your input: a vague prompt like "write a Playwright test for login" produces a skeleton with YOUR_URL placeholders and CSS selectors because that's what most training data looks like. The same request with a role, specific context, and explicit constraints on locator types produces code you can paste directly. This article covers the four-part prompt structure, the specific instructions that prevent common AI test generation mistakes, prompts for code review and debugging, and what to do when output drifts back toward generic patterns after several iterations.
Why AI output quality is mostly your problem
Claude, ChatGPT, and GitHub Copilot are not search engines. They generate the most probable continuation of your input. If your input is vague, the continuation is generic. If your input is specific and well-structured, the output is useful.
The model doesn't know:
- What framework you're using
- What your codebase looks like
- What level of detail you want
- What you've already tried
- What constraints you're working under
When you get bad output, the first question isn't "is this model bad?" It's "did my prompt give the model what it needed to do this well?"
The four-part prompt structure
A prompt that works for QA tasks typically has four components. You don't always need all four, but when output is poor, one of these is usually missing.
Role: tell the model what expert it should be. "You are a senior QA automation engineer" or "You are a TypeScript expert who specializes in Playwright test architecture" changes the register and depth of the output. Context: describe the situation. What framework, what language, what kind of app, what constraint you're working under. The more specific, the better. Task: the actual request, stated precisely. Not "write a test" but "write a Playwright test in TypeScript that logs in using storageState and verifies that the dashboard header shows the username." Format: how you want the output. Raw code only? Code with comments? With a brief explanation first? A table? Specify this or you'll get whatever the model defaults to.Example of a bad prompt:
"Write a Playwright test for login"
Example of the same prompt with all four components:
"You are a senior QA automation engineer. I'm working on a Playwright TypeScript project using the Page Object Model pattern. The login page is at https://lab.becomeqa.com, uses an email/password form, and redirects to /dashboard on success. Write a test that: logs in with test credentials from env variables, asserts the URL changes to /dashboard, and asserts a heading with text 'My Travel Items' is visible. Use getByRole and getByLabel locators only. No comments in the code."
The second prompt gets you code you can paste directly. The first gets you a skeleton with placeholders.
Prompting for test generation
When generating Playwright tests, the most common problems are:
Generic locators. AI defaults to what it's seen most in training data, which includes a lot of CSS selectors. Force it away from them explicitly."Use only getByRole, getByLabel, getByPlaceholder, or getByTestId locators. Do not use CSS selectors or XPath."Missing assertions. Generated tests often have one assertion at the end. Real tests need more.
"Include assertions at each meaningful step, not just the final state. Verify intermediate states where they matter."No error handling context. If you're testing an error scenario, tell it explicitly.
"Write a test for the case where the user submits the form with an empty email field. Assert that the error message 'Email is required' appears below the email input."Hardcoded test data. Prevent this upfront.
"Use process.env for credentials. Do not hardcode any usernames, passwords, or URLs. Use placeholder env variable references instead."
Generating tests from a user story
This prompt pattern is particularly useful. Paste a user story directly and ask for tests:
"You are a QA automation engineer. Here is a user story:>
'As a logged-in user, I want to add a new item to my travel list so that I can track what to pack.'>
The app is at https://lab.becomeqa.com. Write Playwright TypeScript tests that cover: the happy path (item is added and appears in the list), adding a duplicate item, and adding an item with a name that exceeds the character limit. Use getByRole locators. Include one assertion per test that verifies the specific outcome. Return only the test code."
Generating test data
For test data generation, Faker.js syntax changes between versions and AI models sometimes hallucinate method names. Anchor it:
"Generate a TypeScript factory function that creates a random user object with fields: email (valid format), firstName, lastName, password (at least 8 characters, one uppercase, one number). Use @faker-js/faker v8 syntax only. Return just the function."
Prompting for code review
AI code review is underused by QA engineers. It's particularly good at catching things that are easy to miss on first read.
Review for test quality:"Review this Playwright test for: hardcoded waits (page.waitForTimeout), CSS selectors, missing assertions, tests that depend on execution order, and any practices that reduce test stability. List each issue with the line it appears on and why it's a problem."Review for Page Object Model consistency:
"I'm building a Page Object Model in TypeScript. Here is my base page class and one page object. Review whether: all selectors are in the page class (not in tests), action methods return promises correctly, the constructor follows the pattern, and the class is missing any methods that would be useful based on what's imported. Suggest specific additions."Spot-check locator quality:
"Look at these locators from my Playwright tests. For each one, tell me: is it fragile (likely to break on a UI redesign)? If yes, suggest a more stable alternative using semantic locators."
Prompting for debugging
When a test fails and you're not sure why, AI can help, but you need to give it the failure output, not just the code.
Template for debugging help:"This Playwright test is failing. Here is:
1. The test code
2. The full error message and stack trace
3. What the app is supposed to do>
[paste each]>
What are the most likely causes of this failure? List them in order of likelihood, and for each, explain what I should check."
The error message is critical. Without it, the model is guessing. With it, it can often pinpoint the exact issue: a missing await, a locator that doesn't match, a timing problem.
Flaky test diagnosis:"This test passes about 70% of the time and fails 30% of the time with this error: [error]. The test does: [describe the flow]. What are the likely causes of this intermittent failure in Playwright? What would you check first?"
Prompting for documentation
Generating documentation for existing test code is one of the highest-value uses of AI in QA. Nobody likes writing it, it takes time, and AI does it reasonably well.
Generate a test suite README:"Write a README section for this Playwright test suite directory. Include: what the suite covers, prerequisites (env variables required, setup steps), how to run all tests, how to run a specific file, and how to run in debug mode. Base it on this playwright.config.ts and this example test file: [paste both]"Summarize test coverage:
"Here are my Playwright test files. For each file, write one sentence describing what it tests. Then write a summary paragraph of what's covered and what's missing based on the typical flows for a web app with login, data management, and export features."Generate a bug report from a failing test:
"This Playwright test is failing and I need to write a bug report. Here's the test, the error, and screenshots. Write a bug report in this format: Title, Environment, Steps to Reproduce, Expected Result, Actual Result, Severity."
Prompting for learning
When you're learning a new Playwright feature, don't ask for generic examples. Ask for examples specific to your project type:
"Explain how Playwright's storageState works, using a concrete example where a test suite has admin and regular user roles. Show the setup file that generates the states and a test that uses each role. Use TypeScript."
"I understand how page.route() works for simple mocking. Explain how to use route.fallback() to partially mock an API, allowing some requests through to the real backend while intercepting only specific ones. Show a concrete example."
The concrete constraint forces the model to go past the introductory explanation into the part that's actually useful.
When to iterate and when to restart
After getting output from AI, you have two options: iterate on it or restart with a better prompt.
Iterate when the structure is right but something specific is wrong. Ask targeted follow-up questions:- "Change the locator on line 8 to use getByLabel instead of getByPlaceholder"
- "Add an assertion after the form submission that checks the success toast is visible"
- "Refactor the login helper into a separate function"
A good heuristic: if you've iterated three times and are still fixing structural problems, restart with a more detailed prompt.
Context windows and long conversations
AI tools lose context across long conversations. After 20+ back-and-forth exchanges, the model may start "forgetting" things you established early on, like the locator convention you specified or the class structure you asked it to follow.
Three practices help:
Keep a prompt template document. For common tasks (generate a test, review a page object, write a bug report), save your best prompts. Don't retype them. Copy and adjust. Start a new conversation for each task. Reusable context in one conversation creates confusion in another. Short, focused conversations produce better output than sprawling sessions. Re-state key constraints if the output drifts. If the model starts using CSS selectors after you told it not to, remind it explicitly: "Remember: only getByRole, getByLabel, getByPlaceholder, getByTestId. No CSS selectors."What AI tools cannot do for QA
Understanding the limits prevents frustration and helps you know when to stop trying.
They cannot run your tests. The model doesn't have access to your browser, your app, or your test runner. It can write code it believes will work, but it can't verify. They cannot inspect your live app. Without Playwright MCP set up, the model has no idea what your UI looks like. It's writing tests against an imaginary version of your app based on what you describe. If your description is incomplete, the generated locators will be wrong. They hallucinate API details. AI tools sometimes generate Playwright method calls that don't exist, or use real methods with wrong parameter signatures. Always check generated code against the Playwright docs before committing it. They don't know your team's conventions. Unless you paste your actual code for context, the model doesn't know whether your team uses factories or fixtures for test data, how your base page class works, or what your import structure looks like. Give it examples. They can't make judgment calls about coverage. "What should I test?" requires understanding the risk of the feature, the team's velocity, what's been failing in production, and what the acceptable risk level is. AI can suggest things to test, but coverage decisions are yours.Building a personal prompt library
Over time, collect the prompts that work. Keep them in a markdown file, a Notion page, or a .claude commands folder in your project.
Useful categories to build out:
- Generate a new test from a description
- Review a test file for quality issues
- Generate test data with Faker
- Write a Page Object class from a page description
- Debug a Playwright error
- Generate a bug report from a failing test
- Summarize test coverage gaps
- Generate CI workflow YAML for Playwright
Each prompt is a small investment in consistent quality. You don't have to reinvent the approach every time you sit down with an AI tool.
→ See also: AI in QA 2026: What's Actually Useful and What's Hype | Using ChatGPT for Test Case Generation: A QA Engineer's Practical Guide | GitHub Copilot for QA Engineers: What It's Actually Good For | AI Test Generation Tools Compared: What Actually Works in 2026