новый цдх от клаудеры выйдет примерно в начале следующего года, но скоро скажут точнее
hadoop distcp hftp://hadoop-namenode.cluster1/hbase hdfs://hadoop-namenode.cluster2/hbase
и всё. только мапредьюс на принимающей нужен еще
ну и даже попрошу рекомендовать пост
i.imgur.com
+export HBASE_OPTS="-Xms24g -ea -Xmx24g -XX:+UseParallelGC -XX:+UseNUMA -XX:+UseParallelOldGC -XX:+UseCompressedOops -XX:MaxGCPauseMillis=400"
-export HBASE_OPTS="-Xms8096m -ea -Xmx8096m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=50 -XX:+UseCMSInitiatingOccupancyOnly -XX:NewSize=300M -XX:MaxNewSize=300M"
+export HBASE_OPTS="-Xms24g -ea -Xmx24g -XX:+UseParallelGC -XX:+UseNUMA -XX:+UseParallelOldGC -XX:+UseCompressedOops -XX:MaxGCPauseMillis=400"
-export HBASE_OPTS="-Xms8096m -ea -Xmx8096m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=50 -XX:+UseCMSInitiatingOccupancyOnly -XX:NewSize=300M -XX:MaxNewSize=300M"
As we mentioned in the write path blog post, HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore. In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the updates (also called edits) from the WAL file.
A region server serves many regions. All of the regions in a region server share the same active WAL file. Each edit in the WAL file has information about which region it belongs to. When a region is opened, we need to replay those edits in the WAL file that belong to that region. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting. It is a critical process for recovering data if a region server fails.
CDH 4 is based mainly on Hadoop 2.0, which Cloudera says it has tested extensively.
HDP 1 is based on Hadoop 1.0, on the theory that nobody has properly tested Hadoop 2.0, which is still characterized as “alpha”.
CDH 4 boasts sub-second NameNode failover.
Hortonworks is partnering with third parties such as VMware to address the high-availability problems caused by failover potentially taking several minutes.
Hadoop 2.0 and CDH 4 also incorporate improvements to NameNode scalability, HDFS (Hadoop Distributed File System) performance, HBase performance, and HBase functionality.
As does CDH 4, HDW 1 includes HCatalog, an extension of Hive technology that serves as a more general metadata store.
Hortonworks thinks HCatalog is a big deal in improving Hadoop data management and connectivity, and already has a Talend partnership based on HCatalog. Cloudera is less sure, especially in HCatalog’s current form.
HDP 1 includes Ambari, an Apache open source competitor to Cloudera Manager (the closed-source part of Cloudera Enterprise). Hortonworks concedes a functionality gap between Ambari and Cloudera Manager, but perhaps a smaller one than Cloudera sees.
Hortonworks thinks Ambari being open source means better integration with other management platforms. Cloudera touts the integration features and integrations of Cloudera Manager 4.
Nobody seems confident that MapReduce 2 is ready for prime time. While it’s in CDH 4, so is MapReduce 1.
*”CDH” stands, due to some trademarking weirdness, for “Cloudera’s Distribution including Apache Hadoop”. “HDP” stands for “Hortonworks Data Platform”.
static.usenix.org
отчот про использование hdfs в яху
отчот про использование hdfs в яху
на одну ноду если чо
A comment from Hadoop: The Definitive Guide, Second Edition brings the difference between HBase and traditional DBMSs into sharp relief, "We currently have tables with hundreds of millions of rows and tens of thousands of columns; the thought of storing billions of rows and millions of columns is exciting, not scary."
Many readers, including the author, will find the idea of a table with "tens of thousands of columns" and no intrinsic metadata as unusable, ungovernable and yes, scary! Just because the technology allows for such a structure does not make it good architectural practice. Database designers should apply and adapt the best practices they employ when designing column-oriented databases when working with HBase's more flexible data structures.