End-to-end tests miss the fatal errors. You need the annoying testers who ask the dumb questions to find the real risks in your code.

At Yonder, the B2B SaaS company I co-founded, we recently suffered partial data loss after a release that went wrong due to a faulty data migration for a new feature.

If you’re interested in how we managed disaster recovery as a team, please head over to this article.

If you’re interested in the QA side of this incident, read on.

Incidents, Serious Incidents, and Accidents

Ironically, I was attending an event on aircraft accident investigation when disaster struck. And just like with aircraft accidents, disaster struck because one little but important thing was overlooked.

In aviation, there is empirical evidence that 100 incidents lead to 1 serious incident, and 100 serious incidents lead to 1 accident.

Now, therefore, there are two ways aircraft accident investigation can work: Reactively, by analyzing the underlying incidents and serious incidents that led to an accident once an accident has occurred. Or proactively, trying to collect information on as many incidents and serious incidents as possible to prevent future accidents.

Bombers in World War II

In World War II, of every 100 Allied airmen serving on bombers, 45 were killed, 6 were seriously wounded, 8 became Prisoners of War and only 41 escaped physically unharmed. Of those who were flying at the beginning of the war, only 10 percent survived.

Despite those dire statistics, the Allied command insisted that bombing was critical to the success of the war. They wanted to increase armor on their bombers to reduce losses. But because you can’t strengthen a bomber like a tank, they wanted to find out where to put the additional armor to minimize the losses.

Look at the header image of this article for a moment.

As the planes returned from their missions, they counted up all the bullet holes on various parts. The planes showed similar concentrations of damage in three areas: The fuselage, the outer wings, and the tail.

The obvious but wrong answer was to add armor to these heavily damaged areas.

Why was this obvious answer wrong? Because they had only looked at airplanes that had returned. Armor was needed on the sections that, on average, had few bullet holes, such as the cockpit or the engines. Planes with bullet holes in those parts never made it back.

And Now, How Does This Relate to Software QA?

The same sort of bias that the Allied bomber command made in World War II applies to software QA.

In QA, looking at the bullet holes in the fuselage, the outerwings, and the tail is what happens when you run your automated end-to-end tests. You bullet-proof your software for clearly defined routines that you think your users will perform regularly. And every time one of those routines fails, you add a bugfix before the next release.

There is nothing wrong with that. At Yonder, our bug rate has dramatically reduced since we introduced systematic end-to-end tests.

However, what happened during our recent data loss episode was not covered by end-to-end tests. For a new feature, we needed a data migration that modified the metadata of all PDF files in our system. Although this data migration correctly modified the metadata, it corrupted the actual PDF files during the migration.

Now you might ask, how could something as serious as that go unnoticed?

Well, it didn’t go unnoticed. When the data migration was tested, our team noticed that some PDF files in our system were corrupted after running the migration. But they incorrectly attributed the file corruption to the test files used, rather than the data migration itself.

Conclusion

Software QA is not just about looking at the obvious cases. Sure, you need to look at the obvious cases, and that’s why automated end-to-end tests make sense.

But you also need the inconvenient people on your team who treat your software like the dumbest user on the planet, and ask tedious questions on apparently irrelevant things. That’s inconvenient, but it helps you avoid your next data loss disaster.