Several customer conversations at MongoDB World, which was held in Chicago this year, reaffirmed my conviction that there is a burning need for an enterprise class backup and recovery solution for MongoDB databases. Although MongoDB is one of the top 5 most popular databases according to DB-Engines, the backup and recovery capabilities are inadequate. The ecosystem around the database is not mature as well and hence there is a dearth of viable solutions. This lack of enterprise-ready data protection has led several customers to delve on their own in developing a scripted solution from scratch. Most times, such efforts end up in investing significant time and resources without producing a resilient and reliable solution.  In fact, I chatted with a large financial institution and a healthcare organization, both of whom were struggling to handle the backup and archival of their large MongoDB environments. On the other hand, a technology organization was using the backup-as-a-service (Cloud Manager) provided by MongoDB but was concerned about data security, astronomical costs and long recovery time. In this blog, I will explain the options available to an organization to manage their MongoDB data for backup and recovery.

1. Manual Scripted Solutions

These solutions leverage native MongoDB snapshot utility and scripts to transfer data to secondary storage. The scripts (via mongodump) are customized for each MongoDB cluster and require significant operational effort to scale or adapt to any topology changes (such as addition or removal of nodes to your MongoDB database). This is a hidden cost that most organizations overlook when they go down this road. One of our existing customers realized that upkeep of scripted solution required so much operational overhead that they were getting distracted from their primary objective of providing a supply chain management application to their customers.

Furthermore, these scripts are not resilient to failure scenarios, such as the  failure of a node (primary or secondary) or intermittent network issues. Finally, recovery (the paramount value of “backup”) is a manual process, hence, time consuming (resulting in very high application downtime), and contains the risk of data loss due to any bugs in the scripts. Overall, these solutions work when the MongoDB environment is small and some data loss may be permitted in the application. Some of the key issues that these solutions face are:

  • Lack of consistent backup for sharded clusters
  • Database needs to be offline when the snapshots are taken (sometimes achieved through a hidden node that results in additional licensing cost)
  • Both backup and recovery fail under node failure and other infrastructure failures
  • Recovery process is manual and requires verifications, which increases the recovery time
  • Recovery at collection-level requires manual recovery that is time consuming
  • Recovery to unlike topologies to refresh downstream test/dev clusters is not available
  • Keeping up the scripts to ensure reliable backup and recovery is a significant operational overhead

Most enterprises that use these scripted methods as a temporary quick-fix solution eventually realize the limitations and start looking for better solutions. It is like driving your car with a flat tire — it can keep  you going, however you can neither go at the speed you want to go nor are you risk free from disasters.

2. Backup as a Service Solution (Cloud Manager)

MongoDB (the company) itself provides a managed service to backup MongoDB databases in the public cloud. In addition to being exorbitantly costly, the managed backup service stores customers’ data in the public cloud. Backup data transfer over WAN may not work for customers who deploy MongoDB on-premise and for the customers who need to keep their sensitive data in-house. Furthermore, there are significant data limitations per shard to use this service. If you are using this service, make sure you understand how the pricing is calculated. From the documentation, it is evident that the pricing is based on the uncompressed data, which may be multiple times the compressed data on-disk if Wired Tiger storage engine is used. Some of the key issues that this solution faces are:

  • Relevant only for customers that are deployed in public cloud
  • Data is stored by the vendor and hence accessibility is limited and security considerations exist
  • Recovering data on request takes longer than self-service capable solutions
  • Cost is generally 10-12X higher than any other solution especially, if customer uses Wired Tiger
  • Longer retention time and shorter backup intervals than default result in additional costs

3. DIY Backup using Ops Manager

MongoDB (the company) also offers a DIY backup & recovery solution for customers that license the MongoDB Enterprise Advanced version of the software. Though using the MongoDB on-premise backup service is possible, it is overly complex to deploy and operationalize (The deployment diagram speaks for itself!).  Enterprises need to deploy multiple servers, additional databases replica sets (with additional licensing cost) and about ~10x+ storage capacity (of the database that is backed up) for enabling on-premise backups. Overall, on-premise backup service is a theoretical solution that requires significant CAPEX and OPEX investments, including:

  • Complexity of deploying multiple databases
  • Cost of additional infrastructure (servers and storage)
  • Cost of licensing additional MongoDB nodes
  • Risk of failed backups when nodes fail (secondary from which backup is taken)
  • Siloed backup infrastructure for only MongoDB database

4. Software-only Datos IO RecoverX Solution

Datos IO RecoverX is an enterprise-class data protection and management software product that is used by several Fortune 500 enterprises. Being a software-only product, it may be deployed flexibly on-premise on physical servers or virtual machines, or in public cloud. It works with any existing backup storage that is used by customers (any NFS or object storage) and hence has minimal additional infrastructure requirements. Other than simplicity in deployment and use, RecoverX is based on CODRTM architecture that is highly failure resilient. Failure of nodes or failover of primary nodes is automatically handled without any impact to backup and recovery process. The list of key benefits are:

  • Software-only solution that may be deployed on-premise or in public cloud
  • Minimal infrastructure requirements (1X storage and compute)
  • Continuous data protection for any point-in-time recovery
  • Collection-level backup and recovery
  • Recovery to same or different topology cluster for test/dev environment refresh
  • Easy to deploy using AMI or Docker Image
  • Horizontal solution across multiple data sources including MongoDB, Cassandra, HDFS, SQL

It is important to understand the data loss risk when using a new database such as MongoDB and implement data protection solution that meets the RPO and RTO objectives. There are few choices that exist but most of them have drawbacks either in feature/functionality or in cost/complexity. Datos IO RecoverX is a next-generation data protection solution that caters to the needs of application owners and DevOps, and takes away the operational hassles of deploying and managing protection infrastructure. Most importantly, it is a reliable and scalable solution to use even in scenarios of node failures which leads to optimal performance through minimized recovery time (RTO).