Previously, we discussed how backup needs to be reinvented for a new era cloud-native applications, and what you should look for in a modern database backup and recovery solution. But even if you scrap the traditional approach to backup and write scripts to handle the task, backup and recovery of next-generation applications and databases (Apache Cassandra, MongoDB, Amazon DynamoDB, Microsoft DocumentDB, Apache HBase, et al) are both frustratingly difficult. But why is that?
The short version is this: In an eventually consistent write-anywhere non-relational database architecture, it is almost impossible to capture a consistent state backup. Successful data recovery is even less likely!
There isn’t just one reason why this is the case. It comes down to the basic nature of a distributed architecture that’s designed to both scale and withstand node failures without downtime. As we’ve highlighted below, there are several challenges from the start that make backing up a distributed architecture difficult:
- Data is written to an available node. The first landing point for data is not predictable, so there’s no way to catch data as it lands on the node.
- Then data is replicated to at least one other node before verification. That ensures a valid write, but also immediately creates a copy of the data.
- … then replicated to one or two more nodes for availability. This happens afterwards, and the same data can get updated in quick succession.
- … meaning there are 3 or 4 copies of data at any time.
- … and nodes are never immediately consistent.
- … and backups are hugely inefficient since every replica is copied.
- And when a node fails, the topology changes mid-flight.
Theoretically, a good DevOps team might be able to write scripts that will backup the database successfully 80 to 90 percent of the time (cases like multi-node failures, changing topologies, compaction of databases, and many more are complex to model, so scripts are difficult to get right).
Unfortunately, backup is the “easy” part of the equation. Restoring a backup is where it really gets challenging. A successful restore is far more complicated than most people assume. It involves the following steps:
- Rebuilding the correct topology. Because each node was backed up individually, the database has to be restored to the same topology (6 nodes to 6 nodes, 12 nodes to 12 nodes), etc. This relates to enabling recovery across clouds, test/dev and CI/CD use cases.
- Waiting for the database to repair and restore itself. The non-relational database architecture can withstand node failures and keep running, but a full restore with data reconciliation will bring back memories of the worst RAID rebuilds on slow drives.
- Deduplicating multiple inconsistent copies. The backup copy has 3 or more copies of all data that may be in an inconsistent state. It needs to be deduplicated or reconciled first.
- But data history isn’t always available. In most distributed architectures, the oplog is a circular buffer that essentially eats itself as it goes! It may not be possible to even recover the necessary log data to reconcile the database.
- And there’s no reliable way to go back to a point-in-time. Even with the oplog data, it would be very difficult to reconstruct point-in-time data. At best, you’d get a recent, mostly accurate copy of the database.
In the real world, restores can take days or weeks — if the data can be recovered at all. But as the recent Gitlab outage shows, it is difficult for even technically savvy organizations to get it right. Human error can be as fatal to a database as a natural disaster if there isn’t a reliable backup and recovery process in place.