Real-Life Bugs
Bugs that made history โ rockets exploded, hospitals harmed patients, billions were lost. Every story has a lesson.
๐1996SpaceInteger OverflowAriane 5 Rocket Explosion
A rocket self-destructed 37 seconds after launch because of a software bug that took 10 years to get there.
Cost: $370 million
โถ
Ariane 5 Rocket Explosion
A rocket self-destructed 37 seconds after launch because of a software bug that took 10 years to get there.
Cost: $370 million
The Full Story
๐ Lesson for QA
Reused code from a different system is not automatically safe in a new context. Test under the actual conditions of the new system, not the old one.
๐ค Think About It
If you were the QA on the Ariane 5 project, how would you have approached testing the reused Ariane 4 code? What questions would you have asked before signing off on the integration?
๐ช1999SpaceWrong UnitsMars Climate Orbiter
NASA lost a $327 million spacecraft because one team used metric units and another used imperial. No one noticed for 9 months.
Cost: $327 million
โถ
Mars Climate Orbiter
NASA lost a $327 million spacecraft because one team used metric units and another used imperial. No one noticed for 9 months.
Cost: $327 million
The Full Story
๐ Lesson for QA
Interface contracts matter. When two systems communicate, the format, units, and data types must be explicitly agreed on and tested at the boundary โ not assumed.
๐ค Think About It
The two teams worked for 9 months without catching the mismatch. What kind of integration test could have caught this on day one? Think about what a "unit boundary test" looks like when two separate companies are involved.
โข๏ธ1988MedicalRace ConditionTherac-25 Radiation Machine
A radiation therapy machine killed patients by delivering massive overdoses. The bug: a race condition that only triggered when nurses typed fast.
Cost: At least 3 deaths
โถ
Therac-25 Radiation Machine
A radiation therapy machine killed patients by delivering massive overdoses. The bug: a race condition that only triggered when nurses typed fast.
Cost: At least 3 deaths
The Full Story
๐ Lesson for QA
Race conditions depend on timing and are notoriously hard to catch in normal testing. Safety-critical systems need independent hardware interlocks โ software alone is not enough.
๐ค Think About It
The bug only appeared when nurses typed quickly. How do you test for timing-dependent bugs? What testing techniques โ stress testing, concurrency testing, or user simulation โ could have surfaced this before it reached patients?
๐ธ2012FinanceDeployment ErrorKnight Capital: $440M in 45 Minutes
A trading firm lost $440 million in 45 minutes because old code was accidentally re-enabled during a deployment. The company went bankrupt.
Cost: $440 million
โถ
Knight Capital: $440M in 45 Minutes
A trading firm lost $440 million in 45 minutes because old code was accidentally re-enabled during a deployment. The company went bankrupt.
Cost: $440 million
The Full Story
๐ Lesson for QA
Deployments must be automated, verified, and consistent across all instances. A single server running old code can be catastrophic. Also: delete dead code, don't just disable it with a flag.
๐ค Think About It
Knight Capital had no automated check that all 8 servers were identical after deployment. What would a post-deployment smoke test have looked like here? What signals could a monitoring system have caught within the first 60 seconds of trading?
๐2003InfrastructureSilent ExceptionNortheast Blackout: 55 Million Without Power
A software bug silently swallowed an alarm. Power grid operators had no idea their system was failing โ until 55 million people lost electricity.
Cost: $6 billion
โถ
Northeast Blackout: 55 Million Without Power
A software bug silently swallowed an alarm. Power grid operators had no idea their system was failing โ until 55 million people lost electricity.
Cost: $6 billion
The Full Story
๐ Lesson for QA
Silent failures are the most dangerous kind. A system that crashes loudly is better than one that silently ignores errors. Error handling must be tested explicitly, not just the happy path.
๐ค Think About It
The alarm system had been silently broken for over a year in production. What does your test suite need to look like to catch "the alarm doesn't fire when it should"? How do you write a test for the absence of something?
โ๏ธ2017CloudHuman ErrorAmazon S3 Outage (One Typo)
Amazon's S3 storage went down for hours, taking a huge chunk of the internet with it. Cause: one engineer typed the wrong number during a debugging command.
Cost: Billions in downtime
โถ
Amazon S3 Outage (One Typo)
Amazon's S3 storage went down for hours, taking a huge chunk of the internet with it. Cause: one engineer typed the wrong number during a debugging command.
Cost: Billions in downtime
The Full Story
๐ Lesson for QA
Runbooks and operational commands need safeguards โ confirmation prompts, dry-run modes, limits on blast radius. And your status page should not depend on the service it's monitoring.
๐ค Think About It
Amazon's own health dashboard showed green during the outage because it ran on S3. Think about your current or future projects โ are any of your monitoring tools hosted on the same system they monitor? What does true independent monitoring look like?
๐ฉธ2014SecurityMissing Bounds CheckHeartbleed: The Bug That Bled the Internet
A two-line bug in OpenSSL let anyone read private memory from servers โ passwords, private keys, everything. It had been sitting in production for 2 years.
Cost: Unknown โ affected ~17% of secure websites
โถ
Heartbleed: The Bug That Bled the Internet
A two-line bug in OpenSSL let anyone read private memory from servers โ passwords, private keys, everything. It had been sitting in production for 2 years.
Cost: Unknown โ affected ~17% of secure websites
The Full Story
๐ Lesson for QA
Never trust user-supplied lengths, sizes, or indices without validation. Always check that a claimed size matches the actual data. This class of bug (buffer over-read) is as old as computing and still being made.
๐ค Think About It
Heartbleed left no logs โ attackers could extract data with zero trace. How would you test a "heartbeat" feature for this class of vulnerability? What does a security-focused test case look like compared to a functional one?
๐ฎ2020GamingUnder-tested PlatformCyberpunk 2077 Launch
One of the most anticipated games in history launched broken on PlayStation โ crashes, visual glitches, unplayable performance. CD Projekt Red issued mass refunds and removed the game from the PlayStation Store.
Cost: $51 million in refunds + reputation
โถ
Cyberpunk 2077 Launch
One of the most anticipated games in history launched broken on PlayStation โ crashes, visual glitches, unplayable performance. CD Projekt Red issued mass refunds and removed the game from the PlayStation Store.
Cost: $51 million in refunds + reputation
The Full Story
๐ Lesson for QA
Testing on minimum-spec / oldest supported hardware is not optional. If you claim to support a platform, you must test on that platform โ not just the newest or fastest version.
๐ค Think About It
The PS4 base model was listed as a supported platform on the box. At what point in development should QA have flagged the performance gap? How do you write acceptance criteria for "playable performance"?
๐2003AutomotiveSpaghetti CodeToyota Unintended Acceleration
Toyota cars accelerated on their own, causing crashes and deaths. The root cause: a single, deeply flawed piece of software that had 10,000 global variables and no safe failure mode.
Cost: At least 89 deaths, $1.2 billion fine
โถ
Toyota Unintended Acceleration
Toyota cars accelerated on their own, causing crashes and deaths. The root cause: a single, deeply flawed piece of software that had 10,000 global variables and no safe failure mode.
Cost: At least 89 deaths, $1.2 billion fine
The Full Story
๐ Lesson for QA
Software that controls physical systems has no margin for silent failures. Code quality is not just a technical concern โ it is a safety concern. Global mutable state in safety-critical systems is a red flag that demands independent review.
๐ค Think About It
Toyota engineers knew about 400 defects but the product shipped. How does QA operate when business pressure overrides engineering concerns? What does a QA engineer do when they know something is wrong but the release decision is out of their hands?
๐ต2018CloudConfiguration ErrorFacebook: 14 Hours of Darkness
A routine maintenance script accidentally revoked its own access credentials โ then cascaded into locking out the engineers trying to fix it. Facebook went dark for 14 hours.
Cost: ~$90 million in lost ad revenue + trust
โถ
Facebook: 14 Hours of Darkness
A routine maintenance script accidentally revoked its own access credentials โ then cascaded into locking out the engineers trying to fix it. Facebook went dark for 14 hours.
Cost: ~$90 million in lost ad revenue + trust
The Full Story
๐ Lesson for QA
Operational tools that fix outages must not depend on the systems they fix. Access controls, audit tools, and rollback mechanisms all need out-of-band paths that survive a total network failure.
๐ค Think About It
The audit tool confirmed a bad config as valid โ the tool itself had a bug. How do you test the tools that catch bugs? Who tests the testers? Think about how you would design a testing strategy for infrastructure validation scripts.
๐2021FinanceLogic ErrorRobinhood Sells Users' Stocks Without Permission
Robinhood's algorithm incorrectly flagged thousands of accounts as having too much risk and automatically sold their positions โ without notification, during a volatile market.
Cost: Class-action lawsuit, SEC investigation
โถ
Robinhood Sells Users' Stocks Without Permission
Robinhood's algorithm incorrectly flagged thousands of accounts as having too much risk and automatically sold their positions โ without notification, during a volatile market.
Cost: Class-action lawsuit, SEC investigation
The Full Story
๐ Lesson for QA
Automated systems that make irreversible decisions (especially financial ones) need explicit confirmation gates and real-time notifications. "The algorithm decided" is not acceptable when the action cannot be undone.
๐ค Think About It
The system was designed to limit buying โ but it accidentally also sold existing holdings. This is a classic case of a feature doing more than it was supposed to. What test cases would you write to verify that a risk-management feature does *only* what it's supposed to and nothing else?
๐ฅ2012InfrastructureNo Load TestingHealthcare.gov Launch Failure
The US government launched a health insurance website for 300 million Americans. On day one, it crashed under the weight of real users. Almost no realistic load testing had been done.
Cost: $840 million (original budget); over $2 billion total
โถ
Healthcare.gov Launch Failure
The US government launched a health insurance website for 300 million Americans. On day one, it crashed under the weight of real users. Almost no realistic load testing had been done.
Cost: $840 million (original budget); over $2 billion total
The Full Story
๐ Lesson for QA
A launch date is not a substitute for a go/no-go checklist. Load testing must simulate real user volumes, not just comfortable ones. And someone must own end-to-end quality across all system boundaries โ not just one component in isolation.
๐ค Think About It
The engineers knew the site wasn't ready but the launch happened anyway. This is one of the most common real-world QA dilemmas. How do you communicate risk to stakeholders when the schedule is politically fixed? What does a good "go/no-go" recommendation look like?