💥

Real-Life Bugs

Bugs that made history — rockets exploded, hospitals harmed patients, billions were lost. Every story has a lesson.

🚀

1996SpaceInteger Overflow

Ariane 5 Rocket Explosion

A rocket self-destructed 37 seconds after launch because of a software bug that took 10 years to get there.

Cost: $370 million

▶

The Full Story

On June 4, 1996, the Ariane 5 rocket exploded just 37 seconds after launch. The cause? A 64-bit floating point number was converted to a 16-bit signed integer — and it was too large to fit. The conversion overflowed. The code responsible was copied directly from the Ariane 4 rocket software. It worked fine there, because Ariane 4 was slower and the value never exceeded the 16-bit limit. On Ariane 5, speeds were higher. No one tested the reused code against Ariane 5's actual trajectory values. The exception handler crashed the backup computer first, then the primary. The flight computer sent garbage data to the nozzles. The rocket went off-course and self-destructed. $370 million and 10 years of development: gone in 37 seconds.

🎓 Lesson for QA

Reused code from a different system is not automatically safe in a new context. Test under the actual conditions of the new system, not the old one.

🤔 Think About It

If you were the QA on the Ariane 5 project, how would you have approached testing the reused Ariane 4 code? What questions would you have asked before signing off on the integration?

🪐

1999SpaceWrong Units

Mars Climate Orbiter

NASA lost a $327 million spacecraft because one team used metric units and another used imperial. No one noticed for 9 months.

Cost: $327 million

▶

The Full Story

NASA's Mars Climate Orbiter was supposed to enter Mars orbit in September 1999. Instead, it flew too close to the planet and burned up in the atmosphere. The cause: one engineering team at Lockheed Martin used imperial units (pound-force seconds) for thruster data. NASA's navigation team expected metric units (newton-seconds). The two values differ by a factor of 4.45. For 9 months of flight, the spacecraft was gradually pushed off course. The error accumulated slowly, unnoticed. By the time it reached Mars, the trajectory was wrong by 170 km. Two teams, two unit systems, one interface. No validation. No test to check that the units matched.

🎓 Lesson for QA

Interface contracts matter. When two systems communicate, the format, units, and data types must be explicitly agreed on and tested at the boundary — not assumed.

🤔 Think About It

The two teams worked for 9 months without catching the mismatch. What kind of integration test could have caught this on day one? Think about what a "unit boundary test" looks like when two separate companies are involved.

☢️

1988MedicalRace Condition

Therac-25 Radiation Machine

A radiation therapy machine killed patients by delivering massive overdoses. The bug: a race condition that only triggered when nurses typed fast.

Cost: At least 3 deaths

▶

The Full Story

The Therac-25 was a radiation therapy machine used to treat cancer. Between 1985 and 1987, it delivered massive radiation overdoses to at least 6 patients. Three died. The cause was a software bug. The machine had two modes: a low-power electron beam and a high-power X-ray mode. X-ray mode uses a physical filter to protect the patient. The software had a race condition: if an operator typed commands quickly (switching mode and hitting Enter fast), the machine could activate the high-power beam without engaging the protective filter. The bug only triggered under specific timing conditions. Slow typists never saw it. Experienced, fast-typing operators hit it repeatedly. Earlier hardware models had physical interlocks that prevented this. The Therac-25 removed them, relying entirely on software safety. When the software failed, there was no backup.

🎓 Lesson for QA

Race conditions depend on timing and are notoriously hard to catch in normal testing. Safety-critical systems need independent hardware interlocks — software alone is not enough.

🤔 Think About It

The bug only appeared when nurses typed quickly. How do you test for timing-dependent bugs? What testing techniques — stress testing, concurrency testing, or user simulation — could have surfaced this before it reached patients?

💸

2012FinanceDeployment Error

Knight Capital: $440M in 45 Minutes

A trading firm lost $440 million in 45 minutes because old code was accidentally re-enabled during a deployment. The company went bankrupt.

Cost: $440 million

▶

The Full Story

On August 1, 2012, Knight Capital Group deployed new trading software. The deployment process required manually copying the new code to 8 servers. Someone missed one server. That server still had old code — a debug flag called "Power of Penance" that had been dormant for years. When markets opened, 7 servers ran the new code correctly. The 8th server ran the old code and started executing millions of erroneous trades per minute — buying high and selling low, the opposite of what a trading algorithm should do. It took 45 minutes to identify and stop the issue. By then, Knight Capital had lost $440 million. The company couldn't recover. It was acquired shortly after. Root causes: manual deployment, no automated verification that all servers were running the same version, no circuit breaker to detect abnormal trading behaviour.

🎓 Lesson for QA

Deployments must be automated, verified, and consistent across all instances. A single server running old code can be catastrophic. Also: delete dead code, don't just disable it with a flag.

🤔 Think About It

Knight Capital had no automated check that all 8 servers were identical after deployment. What would a post-deployment smoke test have looked like here? What signals could a monitoring system have caught within the first 60 seconds of trading?

🔌

2003InfrastructureSilent Exception

Northeast Blackout: 55 Million Without Power

A software bug silently swallowed an alarm. Power grid operators had no idea their system was failing — until 55 million people lost electricity.

Cost: $6 billion

▶

The Full Story

On August 14, 2003, the largest blackout in North American history hit the northeastern US and Canada. 55 million people lost power. The cause? A race condition in the alarm system software at FirstEnergy in Ohio. When a power line overloaded and tripped, the event was supposed to trigger an alarm in the control room. Instead, a software bug caused the alarm system to silently fail. Operators had no idea anything was wrong. Over the next 90 minutes, as more lines failed (due to cascading effects of the first failure), the system continued silently. Operators made decisions without the information they needed. By the time anyone understood what was happening, a massive cascade failure had knocked out power across eight US states and Ontario. The bug had been in production for over a year. It was never caught because the system appeared to work fine — it just didn't raise alarms when it should.

🎓 Lesson for QA

Silent failures are the most dangerous kind. A system that crashes loudly is better than one that silently ignores errors. Error handling must be tested explicitly, not just the happy path.

🤔 Think About It

The alarm system had been silently broken for over a year in production. What does your test suite need to look like to catch "the alarm doesn't fire when it should"? How do you write a test for the absence of something?

☁️

2017CloudHuman Error

Amazon S3 Outage (One Typo)

Amazon's S3 storage went down for hours, taking a huge chunk of the internet with it. Cause: one engineer typed the wrong number during a debugging command.

Cost: Billions in downtime

▶

The Full Story

On February 28, 2017, Amazon S3 — one of the most critical pieces of cloud infrastructure — went down for about 4 hours. It took a large portion of the internet with it: Slack, GitHub, Docker Hub, Medium, and thousands of other services all degraded or went offline. The cause: a typo. An engineer was debugging a slow S3 subsystem and ran a command to take a small number of servers offline temporarily. They accidentally entered too large a number. A significant portion of the S3 infrastructure went offline. Bringing it back wasn't fast. The S3 subsystem hadn't been fully restarted in years. Tools that were supposed to help had never been tested under this specific failure scenario. It took much longer than expected to bring everything back up. Amazon's response was honest: their own S3 health dashboard was hosted on S3, so it also went down and showed green (healthy) during the outage.

🎓 Lesson for QA

Runbooks and operational commands need safeguards — confirmation prompts, dry-run modes, limits on blast radius. And your status page should not depend on the service it's monitoring.

🤔 Think About It

Amazon's own health dashboard showed green during the outage because it ran on S3. Think about your current or future projects — are any of your monitoring tools hosted on the same system they monitor? What does true independent monitoring look like?

🩸

2014SecurityMissing Bounds Check

Heartbleed: The Bug That Bled the Internet

A two-line bug in OpenSSL let anyone read private memory from servers — passwords, private keys, everything. It had been sitting in production for 2 years.

Cost: Unknown — affected ~17% of secure websites

▶

The Full Story

In April 2014, security researchers disclosed a critical vulnerability in OpenSSL called Heartbleed. OpenSSL is the cryptographic library used by the majority of secure websites (HTTPS). Heartbleed allowed attackers to read up to 64KB of arbitrary memory from a vulnerable server — including private keys, passwords, session tokens, and anything else in memory. The bug was in the "heartbeat" feature: a keep-alive mechanism where a client sends a message and the server echoes it back. The client tells the server how long the message is. The server trusted that length without checking if it was accurate. If a client said "my message is 64,000 bytes long" but actually sent 1 byte, the server would read 64,000 bytes of its own memory and return it. No authentication required. The bug was introduced in December 2011 and sat undetected for over 2 years. During that time, anyone who knew about it could silently extract data from millions of servers — with no trace in the logs.

🎓 Lesson for QA

Never trust user-supplied lengths, sizes, or indices without validation. Always check that a claimed size matches the actual data. This class of bug (buffer over-read) is as old as computing and still being made.

🤔 Think About It

Heartbleed left no logs — attackers could extract data with zero trace. How would you test a "heartbeat" feature for this class of vulnerability? What does a security-focused test case look like compared to a functional one?

🎮

2020GamingUnder-tested Platform

Cyberpunk 2077 Launch

One of the most anticipated games in history launched broken on PlayStation — crashes, visual glitches, unplayable performance. CD Projekt Red issued mass refunds and removed the game from the PlayStation Store.

Cost: $51 million in refunds + reputation

▶

The Full Story

Cyberpunk 2077 was one of the most hyped games ever made. After years of delays, it launched in December 2020. On PC, it was rough but functional. On PlayStation 4 and Xbox One (the base consoles, not the Pro/X versions), it was a disaster. Frame rates dropped to unplayable levels. Characters' faces glitched and stretched. The game crashed constantly. Police AI spawned directly on top of the player. In some areas, the game was simply unfinishable. Sony removed Cyberpunk 2077 from the PlayStation Store — something almost unprecedented. CD Projekt Red issued refunds to all players who asked, at an estimated cost of $51 million. Why? The base PS4 and Xbox One were simply too underpowered for the game as shipped. More critically, those platforms were apparently undertested — developers had focused on the newer, more powerful hardware. The performance gap between the Pro and base console versions was not caught until millions of players experienced it on day one.

🎓 Lesson for QA

Testing on minimum-spec / oldest supported hardware is not optional. If you claim to support a platform, you must test on that platform — not just the newest or fastest version.

🤔 Think About It

The PS4 base model was listed as a supported platform on the box. At what point in development should QA have flagged the performance gap? How do you write acceptance criteria for "playable performance"?

🚗

2003AutomotiveSpaghetti Code

Toyota Unintended Acceleration

Toyota cars accelerated on their own, causing crashes and deaths. The root cause: a single, deeply flawed piece of software that had 10,000 global variables and no safe failure mode.

Cost: At least 89 deaths, $1.2 billion fine

▶

The Full Story

Between 2002 and 2010, Toyota received thousands of complaints about vehicles suddenly accelerating without driver input. The company blamed floor mats and sticky pedals. Investigations eventually pointed to the software controlling the engine throttle. NASA engineers brought in to analyse the code found it to be among the most poorly structured software they had ever reviewed. It had over 10,000 global variables — values shared across the entire program with no isolation. Any part of the code could accidentally modify any of these values. There was no proper fault handling. There was no safe failure mode. The specific bug was a "task death" — a software process could fail silently, leaving the throttle stuck in its last commanded position. Simultaneously, the system had no mechanism to detect this failure and no fallback to close the throttle. Toyota's own internal engineers had flagged over 400 defects in the code before the lawsuits. The company settled a class action for $1.2 billion. Investigations found evidence that executives knew about the software quality problems years before taking action.

🎓 Lesson for QA

Software that controls physical systems has no margin for silent failures. Code quality is not just a technical concern — it is a safety concern. Global mutable state in safety-critical systems is a red flag that demands independent review.

🤔 Think About It

Toyota engineers knew about 400 defects but the product shipped. How does QA operate when business pressure overrides engineering concerns? What does a QA engineer do when they know something is wrong but the release decision is out of their hands?

📵

2018CloudConfiguration Error

Facebook: 14 Hours of Darkness

A routine maintenance script accidentally revoked its own access credentials — then cascaded into locking out the engineers trying to fix it. Facebook went dark for 14 hours.

Cost: ~$90 million in lost ad revenue + trust

▶

The Full Story

On March 13, 2019, Facebook experienced one of its largest outages — roughly 14 hours of downtime affecting Facebook, Instagram, and WhatsApp. For most users it looked like a server problem. The real cause was far stranger. A routine maintenance operation ran a configuration change on Facebook's internal backbone routers. The change contained an error. It triggered an audit tool that was supposed to check the change — but the audit tool had a bug too: it confirmed invalid configurations as valid. The new router configuration was propagated globally. It disconnected Facebook's data centres from each other and from the public internet. When engineers tried to use their internal tools to roll back the change, those tools also relied on the network that had just gone down. The situation got worse: Facebook's access control systems — the ones that manage who can log into what — were also offline. Engineers needed physical access to fix the routers. Physical access required badge authentication. Badge authentication was down. Security teams had to manually verify identities to let engineers into the buildings. The fix required physically touching servers in multiple data centres around the world. One command at a time, the network was rebuilt.

🎓 Lesson for QA

Operational tools that fix outages must not depend on the systems they fix. Access controls, audit tools, and rollback mechanisms all need out-of-band paths that survive a total network failure.

🤔 Think About It

The audit tool confirmed a bad config as valid — the tool itself had a bug. How do you test the tools that catch bugs? Who tests the testers? Think about how you would design a testing strategy for infrastructure validation scripts.

📈

2021FinanceLogic Error

Robinhood Sells Users' Stocks Without Permission

Robinhood's algorithm incorrectly flagged thousands of accounts as having too much risk and automatically sold their positions — without notification, during a volatile market.

Cost: Class-action lawsuit, SEC investigation

▶

The Full Story

In January 2021, during the height of the GameStop trading frenzy, Robinhood made a series of decisions that would spark congressional hearings and class-action lawsuits. Among them: an automated risk system that silently liquidated user positions. The system was designed to reduce the platform's aggregate risk exposure during extreme market volatility. When certain thresholds were hit, it was supposed to limit new purchases. Instead, a logic error caused it to also sell existing positions — shares users already owned and had not asked to sell. Users woke up to find their stocks had been sold, often at a loss. The notifications, if they came at all, arrived after the fact. Many users were not even aware the liquidation had happened until they checked their balances. Robinhood's public explanation was vague. The company faced a flood of complaints, regulatory investigations, and lawsuits. The incident highlighted a pattern: the system had been making consequential financial decisions — selling people's assets — without any human review or user confirmation step.

🎓 Lesson for QA

Automated systems that make irreversible decisions (especially financial ones) need explicit confirmation gates and real-time notifications. "The algorithm decided" is not acceptable when the action cannot be undone.

🤔 Think About It

The system was designed to limit buying — but it accidentally also sold existing holdings. This is a classic case of a feature doing more than it was supposed to. What test cases would you write to verify that a risk-management feature does *only* what it's supposed to and nothing else?

🏥

2012InfrastructureNo Load Testing

Healthcare.gov Launch Failure

The US government launched a health insurance website for 300 million Americans. On day one, it crashed under the weight of real users. Almost no realistic load testing had been done.

Cost: $840 million (original budget); over $2 billion total

▶

The Full Story

Healthcare.gov launched on October 1, 2013, as the portal for millions of Americans to sign up for health insurance under the Affordable Care Act. Within hours, it was essentially unusable. The site timed out, lost user data, showed errors, and crashed repeatedly. The failures were not a mystery. The site had been built by over 50 different contractors with poor coordination. Integration testing between the systems was minimal. Most critically, the site had never been tested under anything close to realistic load. In the days before launch, internal tests with a fraction of expected user volume had already shown the system struggling. The decision was made to launch anyway — the political deadline was immovable. On day one, roughly 250,000 users tried to sign up. The system had only been tested at a fraction of that volume. For the first two months, the site was essentially broken for most users. A dedicated "tech surge" team was brought in — many from Silicon Valley — and the site was stabilised by late November. But by then, the political and reputational damage was done. The fundamental failures: no single team owned end-to-end quality, no realistic integration tests, no load tests, and a political timeline that overrode engineering readiness signals.

🎓 Lesson for QA

A launch date is not a substitute for a go/no-go checklist. Load testing must simulate real user volumes, not just comfortable ones. And someone must own end-to-end quality across all system boundaries — not just one component in isolation.

🤔 Think About It

The engineers knew the site wasn't ready but the launch happened anyway. This is one of the most common real-world QA dilemmas. How do you communicate risk to stakeholders when the schedule is politically fixed? What does a good "go/no-go" recommendation look like?