I recently returned from my first Strata Hadoop event in New York! It was amazing to see the excitement and innovation across the board from different vendors – enterprise products that make big data operationally ready, to a slew of applications that aid developers perform advanced analytics on large-scale data. Here are some insights I derived after talking to more than 100 people on the event floor last week.

First, a majority of organizations are now using big data applications as part of the critical business workflow. This means that big data applications are directly tied to either business growth (customer / revenue), or business operations (management / IT). This makes the availability of these applications and the data underneath quite important. Granted, not all data has equal value and some data that lands on a Hadoop (HDFS) system is transient in nature. But, a significant percentage of data is important for businesses to run effectively. Hence, the recoverability of this data needs to be thought through well. Critical use cases are recovery from data loss, compliance, governance, and recovery to test/development environments.   

Second, and to my surprise, a lot of people I met did not fully comprehend the difference between replication and backup. The prevalent thought was that Hadoop (or even S3) is ‘resilient enough’ given native replication. Although the Hadoop (HDFS) file system offers native replication, it lacks point-in-time backup and recovery capabilities. Replication provides high-availability, but no protection from logical or human error that can result in data loss, and ultimately result in lack of meeting compliance and governance standards. Any logical or human error can result in data loss and is detrimental to business applications. Oftentimes, for analytical data stores, data may be reconstructed from the respective data source but it takes long time and is operationally inefficient, leading to data loss.

Third, given the large scale of Hadoop environments, a manually scripted solution for backup and recovery is complex to build and maintain. I met a customer that had built such a solution – node-by-node snapshots and scripts to move the data to classical backup storage target such as EMC Data Domain. Not only did it prove to be error-prone flaky, but it is also an expensive solution due to storage costs. Given the large scale (node count and data set sizes) and the use of direct-attached storage in Hadoop clusters, traditional backup and recovery software products do not work, leaving a critical data protection gap. Block based inline deduplication that is used by all traditional products does not scale to petabytes, resulting in performance bottlenecks for both backup and recovery. Further, the use of media servers result in choke points in the data path.

At Datos IO, we have built a simple and scalable solution for backup and recovery of Hadoop (HDFS) clusters at petabyte scale. RecoverX is the industry first data protection software for customer-centric applications (analytics, IoT, et al) that supports non-relational and NoSQL databases such as Apache Cassandra, MongoDB, Couchbase and big data file systems such as HDFS. Some of the key architectural design merits for RecoverX are:

  • Control-plane only – RecoverX only orchestrates the movement of data without sitting in the data path. The data is moved directly from HDFS cluster (Datanodes) to secondary or backup storage. This means that there is no media server bottleneck in data path.
  • Locality aware backup – The data is streamed from Datanodes to backup storage and load balanced across the entire cluster. This locality awareness ensures maximum parallelism in data movement and improvement in backup performance.
  • Semantic Deduplication – This is an industry-first capability that RecoverX brings to reduce the backup storage requirements. If there are multiple duplicate files in the HDFS cluster, only a single file is copied to backup storage. This logical file deduplication scales to petabytes and brings massive storage efficiency.
  • Flexible storage options – Object storage (e.g. AWS S3) is the most cost efficient storage at scale. RecoverX works with multiple storage options including any existing NFS storage or object storage (e.g. Cloudian, Scality, Igneuous, IBM Cleversafe etc.) or cloud storage (AWS S3).
  • Failure handling – Failures such as node failures or intermittent network and storage failures are a norm in large-scale distributed environments. RecoverX has built in failure handling capability to retry and restart backup and recovery from known good state. This is truly an enterprise grade functionality.
  • Scale-out software – RecoverX software can be deployed in a clustered configuration and scales out to meet the requirements of large applications.

If you are running your business critical applications on Hadoop, think about the data recoverability in different disaster scenarios. You may leverage our team of experts to help with local point in time backup and recovery of your clusters in the most optimized and economical manner. We also have an early access program (“Rebellion Early Access Program”) for organizations that are interested in testing RecoverX in their own environment. Send us an email at pm@datos.io for more information on this program.