← All posts tagged hadoop

netneladno

Here are the recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster:
12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
64-512GB of RAM
Bonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher the network throughput needed)
коммодити хардваре яебал

netneladno

Cloudera introduced CDH 4* and Hortonworks introduced HDP 1*, both timed for the recent Hadoop Summit.
CDH 4 is based mainly on Hadoop 2.0, which Cloudera says it has tested extensively.
HDP 1 is based on Hadoop 1.0, on the theory that nobody has properly tested Hadoop 2.0, which is still characterized as “alpha”.
CDH 4 boasts sub-second NameNode failover.
Hortonworks is partnering with third parties such as VMware to address the high-availability problems caused by failover potentially taking several minutes.
Hadoop 2.0 and CDH 4 also incorporate improvements to NameNode scalability, HDFS (Hadoop Distributed File System) performance, HBase performance, and HBase functionality.
As does CDH 4, HDW 1 includes HCatalog, an extension of Hive technology that serves as a more general metadata store.
Hortonworks thinks HCatalog is a big deal in improving Hadoop data management and connectivity, and already has a Talend partnership based on HCatalog. Cloudera is less sure, especially in HCatalog’s current form.
HDP 1 includes Ambari, an Apache open source competitor to Cloudera Manager (the closed-source part of Cloudera Enterprise). Hortonworks concedes a functionality gap between Ambari and Cloudera Manager, but perhaps a smaller one than Cloudera sees.
Hortonworks thinks Ambari being open source means better integration with other management platforms. Cloudera touts the integration features and integrations of Cloudera Manager 4.
Nobody seems confident that MapReduce 2 is ready for prime time. While it’s in CDH 4, so is MapReduce 1.

*”CDH” stands, due to some trademarking weirdness, for “Cloudera’s Distribution including Apache Hadoop”. “HDP” stands for “Hortonworks Data Platform”.

netneladno

Let's delve into the subject of graph data in more detail. Recently there was a paper by Rohloff et. al. <dis> that showed how to store graph data (represented in vertex-edge-vertex "triple" format) in Hadoop, and perform sub-graph pattern matching in a scalable fashion over this graph of data. The particular focus of the paper is on Semantic Web graphs (where the data is stored in RDF and the queries are performed in SPARQL), but the techniques presented in the paper are generalizable to other types of graphs. This paper and resulting system (called SHARD) has received significant publicity, including a presentation at HadoopWorld 2010, a presentation at DIDC 2011, and a feature on Cloudera's Website <cloudera.com>. In fact, it is a very nice technique. It leverages Hadoop to scale sub-graph pattern matching (something that has historically be difficult to do); and by aggregating all outgoing edges for a given vertex into the same key-value pair in Hadoop, it even scales queries in a way that is 2-3 times more efficient than the naive way to use Hadoop for the same task.

The only problem is that, as shown by an upcoming VLDB paper that we're releasing today <cs-www.cs.yale.edu>, this technique is an astonishing factor of 1340 times less efficient than an alternative technique for processing sub-graph pattern matching queries within a Hadoop-based system that we introduce in our paper.

netneladno

Anyhow, Yahoo’s most recent standard Hadoop nodes feature:

8-12 cores
48 gigabytes of RAM
12 disks of 2 or 3 TB each
If you divide 12 by 3 for standard Hadoop redundancy, and take off 1/4, then you have 6-9 TB/node. Multiple that by a compression factor of 6-10X, at least for the “curated data,” and you get to 36-90 TB of user data per node.

As an alternative, suppose we take a point figure from Cloudera’s ranges of 16 TB of spinning disk per node (8 spindles, 2 TB/disk). Go with the 6X compression figure. Lop off 1/3 for temp space. That more conservative calculation leaves us a bit over 20 TB/node, which is probably a more typical figure among today’s Hadoop users.

netneladno

Hadoop runs on commodity hardware; but, for anything more than a simple proof of concept, commodity does not mean a heterogeneous set of surplus machines. As Hadoop moves from proof of concept to full-blown initiative, plan for success. Data sampling is dead, and the success of a big data analytics initiative cannot be shown with a small-scale pilot. Big data analytics is about meeting unmet commercial needs that cannot be met with existing technology. Given the engineering challenge of building a Hadoop cluster at scale, the hardware implementation should be carefully planned. Many of the engineering solutions associated with cloud computing infrastructure are directly applicable to deploying and operating Hadoop clusters.

Additional space, cooling, and power may be required in existing data centers, and because few hardware vendors provide a Hadoop cluster as a single stock-keeping unit (SKU), consistency of build is important. Server selection should focus on the differing requirements of core nodes (JobTracker, namenodes), data nodes (datanodes, TaskTracker) and edge nodes (data movement in/out of the cluster). Core and edge nodes will require increased reliability, redundancy, and support. Hadoop assumes data nodes will fail, so support for these servers can be reduced.

No standard hardware configuration exists for data nodes. When the initiative starts, it will be unclear if the MapReduce job profile is data or compute centric. Provide headroom in memory, cores, and disk so that as the job profile becomes better understood, the hardware can be adjusted as required. The data-aware nature of HDFS and MapReduce means that a physical, not virtualization, model is preferred.

Network bandwidth may become an issue. Hadoop is rack-aware and reduces network bandwidth as it moves data around during the reduce functions. Consider if the cost of increasing the network bandwidth or provisioning additional nodes is the best approach.

Installing, configuring, and implementing Hadoop requires sophisticated instance management, network security engineering, data partitioning, and job coordination. Consider deploying a cluster management utility to reduce the administration effort involved in providing consistency of build and cluster operations (i.e., monitoring and administration).

A detailed analysis of these topics is contained in the Intel white paper "Optimizing Hadoop Deployments." software.intel.com