An ineffective backup strategy was identified to be the main culprit for data loss at source-code hub Gitlab.com yesterday. Gitlab is hugely popular among developers, who appreciate the fact that it’s an all-in-one solution, providing everything a developer needs over the course of a project. At the core is a Git-based version control system, which is paired with helpful extras. As a result, a lot of companies depend on it, ranging from smaller startups and individual developers, to larger enterprises like Intel and Red Hat.

 

git-lab-statusOn Tuesday evening, Gitlab.com found that a fatigued system administrator, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a database replication process. It appears that he deleted a folder containing 300GB of live production data that was due to be replicated. By the time he canceled the action, only 4.5GB remained, and the last potentially viable backup was taken six hours beforehand.

From the detailed information available on Gitlab.com, it is clear that they knew about possible data protection techniques, ranging from volume snapshots to replication, and backup and recovery. However, it is unclear whether they had the right expertise or right data protection and recovery products in house to use these techniques correctly. We have seen this before, enterprises with critical applications either fail to leverage the rights tools available for backup and recovery, or deploy legacy solutions for massively distributed cloud applications, or simply rely on replication as a backup strategy. There are several reasons for failed backups:

  • Neglecting backup and recovery specific tools in favor of home-grown scripts
  • Delaying the deployment of backup and recovery tools until a data loss has occurred
  • Not performing thorough analysis of business requirements before choosing a solution that can meet those application uptime requirements
  • Failing to perform a recovery operation, and blindly believing that it will work

Other reasons exist but what matters is how organizations fix the issue. The real question is “Who is ultimately responsible when errors like this happen?” Is it the operator who made the mistake, the database administrator who is responsible for the database, the architect who designed the end-to-end application stack, or is it the application owner who is impacted by business loss?  

We all know humans aren’t perfect. We make mistakes and even with our daily IT security practices, errors happen. Organizations can take a simple snapshot of all the nodes in a cluster that is ultimately transferred to a backend storage. However, given the distributed nature and frequent hardware failures in scale-out databases, these patchwork solutions, such as node-by-node snapshots become operational nightmares to manage. In the best scenario, it takes several days to recover data, resulting in significant application and business downtime, and in the worst scenario, the data may never be recoverable!

That is why a more robust solution is needed to reduce data loss risk for next-generation application environments. Listed below are some steps organizations can take to develop a reliable data protection and availability strategy:

  • List all possible failure scenarios that may occur in a given environment. Don’t forget human errors!
  • Understand the failure resiliency of the data protection product – no one wants their data protection product to fail when it’s needed most
  • Know about your recovery point objective (RPO) and recovery time objective (RTO) to choose the right data protection product for specific requirements
  • Understand each technology, such as replication, backup & recovery, and snapshots, as well as their limitations
  • Create a recovery plan and test that plan regularly (every quarter) to make sure people and products work as expected during emergency situations

Whatever the cause of failure, the best way to keep them from harming your organization is to verify your backups by performing regular recovery test restores. Although testing your backups regularly won’t prevent failures, they can help in noticing the issue which will will allow you to fix the problem.

It’s important to highlight that even in this incident, Gitlab.com showed that they are dedicated to transparency, even in the worst days.