When a data migration went wrong and wiped production data on a Thursday, our team had one goal: Full SaaS recovery before the weekend.

At Yonder, the B2B SaaS company I co-founded, we perform software releases regularly — just like all software companies. And just like any B2B SaaS company, we perform weekly backup and disaster recovery tests.

A few weeks ago, we planned a release for Thursday at 4 pm. No big deal, shipping a few bugs and preparing a few things under the hood for upcoming features. Testing took a little while due to a data migration required for the upcoming features. Nothing suspicious, and QA signed off on the release.

Just two hours after the release was out, it became clear that the data migration had deleted some production data due to an erroneous operation that was not caught during testing.

Now the whole chain of disaster recovery hit our team. Here is what happened, and what we learned from it.

H+2: Houston, We Have A Problem

Just two hours after the release, I received a chat message from our lead developer saying we had a serious issue and that I was needed urgently on a call. I was in another meeting, but since I’m not often called urgently into a call, I left the meeting and joined the call.

In that call, it was still unclear what caused the data loss. Nevertheless, we agreed to start immediate disaster recovery from backups taken last night. We also prioritized the sequence of customer tenants to be restored from backup.

I left the call and went back to the meeting I was in. I didn’t pay as much attention as I should have because I was thinking about the root cause and possible fixes for the data loss.

H+4: Reducing the Problem

My evening meeting was over, and I was at the networking reception that followed the meeting. My phone vibrated, informing me that the first customer tenants had been restored from backup and were up and running again.

Even though I love networking, my thoughts were still with the data loss problem. Finding the root cause was one thing, but I was starting to worry that the team would pull an all-nighter to restore each customer tenant from backup and start making mistakes when they got tired. And since it was Thursday night and I didn’t know the root cause of the problem yet, I created my own worst-case scenario: Late on Friday afternoon, we would still not have fixed the problem, the problem was amplified due to mistakes made by a tired team, and our customers would be left without a solution over the weekend.

So I focused my mind on reducing the problem rather than amplifying it. Since the data loss was only affecting one particular file type and not all our customers use this particular file type, our Chief Customer Officer and I could reduce the problem from all customer tenants to roughly half the customer tenants. So I texted back to our lead developer that I had some good news in this bad situation, and that I would call him as soon as I was on the train home from my networking reception.

H+6: The Cherry-Pick Alternative

I try to avoid delicate business calls on a train, but this time, I made an exception. Time was more important than confidentiality, plus the train was empty anyway at 10 pm.

I called our lead developer and told him about the “good news” that only half of our customers were affected. And now it was his turn with the good news: First, the root cause could be clearly identified. And second, instead of restoring customer tenants from backup, the team had already prepared a solution to cherry-pick the deleted data from the backup without requiring a full restore, saving us lots of time. The only thing still open was a test of this cherry-pick solution on a large customer tenant to assess the time component for the fix.

I shared my worry that people would do an all-nighter and get tired and make mistakes, so we agreed that we would call it a night, continue testing the cherry-pick solution the next morning, and make the necessary decisions for the next steps tomorrow at 10 am.

H+14: An Early Start

When I start working around 6 am in the morning, I normally don’t see any “available” badges in Microsoft Teams from our dev team. On this Friday morning, however, I wasn’t the first person at work. The cherry-pick tests were already in full swing.

H+18: Decision Meeting

The whole dev team and I met for a decision meeting, where we discussed the status, risks, and timing of the cherry-pick solution. We all agreed that this solution was safe to proceed, so I informed our Chief Customer Officer that we have a way forward. Within minutes, he provided the priorities list for us to work on restoring service for our customers in the right order. And work started immediately once we left the call.

H+24: End of Crisis

Exactly 24 hours after the release that went wrong, full service was restored to all customer tenants. And just like in a good movie on cybercrime or the news ticker of your favorite news outlet, everybody who was involved could follow the operation in real time in a Microsoft Teams chat.

Success Factors

For such an operation to succeed, many gears have to mesh together. Here are the key success factors we identified:

  1. The instant availability of the entire team to work late and start working again early in the morning. This was not just the dev team; it was also the customer team communicating with customers, setting priorities, and checking customer data after restoration. Of course, we have an on-call team on stand-by for emergencies 24/7, but this operation needed much more resources than we keep on stand-by.
  2. From the start, the entire team was thinking in options rather than focusing on one single course of action. That proved decisive in speeding up the back-to-normal timeline.
  3. Whilst I was busy coordinating the technical activities with the dev team, our Chief Customer Officer communicated proactively with our customers to keep them up to date about system limitations, restoration options, and timelines. Despite the inconvenience caused, our customers appreciated the proactive communication and the frequent status updates. And because the status updates were kept away from the dev team, they could focus on the technical solution instead of being disturbed by constant requests for status updates.
  4. Last but not least, despite all the hectic, we managed to reduce instead of amplifying the problem. Looking back on the hectic we had even after reducing the problem, I don’t want to imagine the havoc we would have created by amplifying the problem.

Conclusion

Bad things do happen, no matter how skilled your team is, or how seriously you take QA. But when the shit hits the fan, you can only succeed when everybody contributes.

Once the hectic was over, we debriefed on what had happened and why, and we could identify some important lessons on how to further improve our QA.

Ironically, this all happened when I was at an event on aircraft accident investigation. The parallels between aircraft accident investigation and software accident investigation are striking — but that’s a story for another day.