The combined force of social, mobile, cloud and Internet of Things has created an explosion of big data that is powering a new class of hyper-scale, distributed, data-centric applications such as customer analytics and business intelligence. To meet the storage and analytics requirements of these high-volume, high-ingestion-rate and real-time applications, enterprises have moved to big data platforms such as Hadoop.
Although HDFS filesystems offer replication and local snapshots, they lack the point-in-time backup and recovery capabilities required to achieve and maintain enterprise-grade data protection. Given the large scale, both in node count and data set sizes, and the use of direct-attached storage in Hadoop clusters, traditional backup and recovery products are ill suited for big data environments – leaving businesses vulnerable to data loss.
To achieve enterprise-grade data protection on Hadoop platforms, there are five key considerations to keep in mind:
- Replication is not same as point-in-time backup. Although HDFS, the Hadoop filesystem, offers native replication, it lacks point-in-time backup and recovery capabilities. Replication provides high-availability but no protection from logical or human error that can result in data loss, and ultimately result in lack of meeting compliance and governance standards.
- Data loss is as real as it always was. Studies suggest that more than 70 percent of data loss events are triggered due to human errors such as fat finger mistakes, similar to what brought down Amazon AWS S3 earlier this year. Filesystems such as HDFS do not offer protection from such accidental deletion of data. You still need the file system backup and recovery and that too at a much granular level (directory level backups) and larger deployment scale, hundreds of nodes and petabytes of filesystem data.
- Reconstruction of data is too expensive. Theoretically, for analytical data stores such as Hadoop, data may be reconstructed from the respective data source but it takes a very long time and is operationally inefficient. The data transformation tools and scripts that were initially used may not be available or the expertise may be lost. Also, the data itself may be lost at the source, resulting in no fallback option. In most scenarios, reconstruction may take weeks to months and result in longer than acceptable application downtime.
- Application downtime should be minimized. Today, several business applications embed analytics and machine learning micro-services that leverage data stored on HDFS. Any data loss can render such applications limited and result in negative business impact. A granular file-level recovery is essential to minimize any application downtime.
- Hadoop data lakes can quickly grow to a multi-petabyte level scale. It is financially prudent to archive data from Hadoop clusters to a separate robust object storage system that is more cost-effective at PB scale.
If you are debating whether you need a solid backup and recovery plan for Hadoop, think about what it would mean if the datacenter where Hadoop is running went down, or a part of the data was accidentally deleted, or applications went down for a long period of time while data was being regenerated. Would the business stop? Would you need that data to be recovered and accessible in short period of time? If yes, then it is time to think about fully featured backup and recovery software that can work at scale. Furthermore, you also need to consider how it can be deployed: on-premise or in the public cloud, and across enterprise data sources.