18 Hadoop Features You Can't Live Without

Guest Post: Sameer Nori, Sr. Product Marketing Manager, MapR

 

If you’ve been looking into an Apache Hadoop distribution for your big data, you’ve probably been hammered with lots of differences in opinion concerning the real necessities. This article details eighteen specific characteristics within four different categories that are critical in your final Hadoop distribution selection.

 

Performance

 

1. Data Ingest - Streaming Writes

Your data needs to arrive in your Hadoop cluster as quickly as possible. This means streaming writes should bear the task of loading and unloading data. Many Hadoop distributions use batch or semi-streaming processes that introduce inefficiencies with larger quantities of data.

 

2. Distributed Metadata Architecture

The original Hadoop architecture utilized a single NameNode to manage the cluster’s metadata. This was quickly recognized as a single point of failure (SPOF). Some distributions get around this by using a secondary NameNode as a backup in the case of a NameNode failure. While this is a step up from the default, it isn’t enough to qualify your system as high availability. Look for a distribution that instead uses a distributed metadata architecture. This removes the NameNode and completely eliminates the problem of a SPOF.

 

3. High Performance with Consistent Low Latency

Low latency is important and equally as important is its consistency. Refer to the graph below:

 

hadoop1.png

 

In this comparison of two Hadoop distributions, you can see that in some cases you can be subject to unpredictable volatility. Instead, be sure that your distribution provides high performance low latency on a consistent basis.

 

4. Access to Public Cloud Platforms

To ensure your ability to scale your data with the ever changing demands of technology, it is important that your distribution has the option of working with public cloud platforms such as Amazon Web Service or Google Compute Engine.

 

Dependability

 

5. High Availability (HA)

Your chosen distribution should have High Availability features as a default in your Hadoop architecture. This includes a distributed metadata architecture. You should be confident that your system will be self-healing in the chance of multiple failures.

 

6. MapReduce HA

MapReduce jobs should be continuous regardless of a system failure. Automated failover and your ability to continue using your job and task trackers should not depend on manual restarts.

 

7. Rolling Upgrades

One of the best characteristics of Hadoop is its ability to be constantly evolving with the needs of the big data industry – this means the anticipation of many upgrades. Rolling upgrades allow you to take advantage of new system improvements without incurring any downtime. Do not settle with a distribution that doesn’t have this included in their architecture.

 

8. Data and Metadata Replication

By default, Hadoop will replicate your data three times. Use a distribution that not only replicates Hadoop’s file chunks, but also table regions and metadata. For additional protection, you should store one of the three copies on a different rack.

 

9. Point-In-Time Snapshots

Not all Hadoop snapshots are alike. Use a distribution that utilizes true point-in-time snapshots. This means that the snapshots capture a real representation of the data at the time it was taken. Many other distributions use the default HDFS snapshot system that will only capture closed data. Also be sure that your snapshot system is compatible with all Hadoop applications without needing to access the HDFS API. Finally, the snapshot system of an optimal distribution won’t require a duplication of your data. By sharing the same storage as your live information, you significantly reduce any impact on your performance and scalability.

 

10. Mirroring Disaster Recovery

Your system should anticipate catastrophic system failures to make recovery simple. Mirroring is the best preventative measure you can take to make disaster recovery not so disastrous. Your Hadoop distribution’s mirroring should be asynchronous and perform auto-compressed, block-level data transfer of differential changes. Your system should mirror both its data and metadata to ensure that applications are able to restart immediately upon site failure.

 

Manageability

 

11. Comprehensive Management Tools

All Hadoop distributions have their own set of management tools. Take the necessary time to evaluate the management tools that any given distribution has to offer. Look at the breadth of the management toolbase and determine if it includes all of your management necessities.

 

12. Heat Maps, Alarms, Alerts

At any time you should easily be able to get a good grip on the condition of your Hadoop system. The monitoring capabilities of your Hadoop architecture should include heat maps, alerts and alarms that let you see the health, memory, CPU and other important metrics of your nodes at a glance.

 

13. Integration with REST API

In order to keep your connectivity open to different open source and commercial Hadoop tools, your architecture should integrate via a REST API.

 

Data Access

 

14. File System Access (POSIX)

A POSIX file system that supports random read/write operations as well as providing NFS access on Hadoop, will open your system up to far greater capacities than what is expected from the default HDFS.

 

15. File I/O

Some distributions have their system’s file input/output as append only. Use a distribution that allows a read/write of the File I/O.

 

16. Developer Productivity Tools

Considering your developer will be working frequently within your Hadoop platform, use a distribution that will make it easy for them to do so. They shouldn’t have to go through administrators for simple tasks like the creation of tables. Your distribution should also provide developer tools such as those that allow them to work directly with data on the cluster.

 

17. Security Features

Ensuring your data’s security is a high priority. Your distribution should include fine-grained permissions on file, directories, jobs, queues and administrative operations. Access control lists should also be default for tables, columns and column families.

 

18. Wire-Level Authentication with Kerberos and Native

 

Evaluate the priority of each of these essentials and take a look at the Hadoop distributions you are considering. There are many other features above the eighteen we have listed and if your distribution doesn’t includes these as a minimum, you may be jeopardizing the productivity of your big data investment.

Labels: MapR
Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
Showing results for 
Search instead for 
Do you mean 
About the Author
Stephen Spector is a HP Cloud Evangelist promoting the OpenStack based clouds at HP for hybrid, public, and private clouds . He was previous...
Featured


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.