Чтобы добавлять сообщения и комментарии, .

@netneladno:
netneladno

свежие записи hbasecon.com

@netneladno:
netneladno

вышел 0.96
новый цдх от клаудеры выйдет примерно в начале следующего года, но скоро скажут точнее

@netneladno:
netneladno

отличная презентация о том какое говно этот ваш хбейз
slideshare.net

@netneladno:
netneladno

не получилось distcp

@netneladno:
netneladno

поцоны, вы не поверите, если вам надо смигрировать допустим hbase с одного кластера на другой, или с одной версии cdh на другую, то на принимающей стороне надо нажать

hadoop distcp hftp://hadoop-namenode.cluster1/hbase hdfs://hadoop-namenode.cluster2/hbase

и всё. только мапредьюс на принимающей нужен еще

@netneladno:
netneladno

поцоны, а расскажите тестировал ли кто Cloudera Impala?
ну и даже попрошу рекомендовать пост

@netneladno:
netneladno

i.imgur.com
+export HBASE_OPTS="-Xms24g -ea -Xmx24g -XX:+UseParallelGC -XX:+UseNUMA -XX:+UseParallelOldGC -XX:+UseCompressedOops -XX:MaxGCPauseMillis=400"

-export HBASE_OPTS="-Xms8096m -ea -Xmx8096m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=50 -XX:+UseCMSInitiatingOccupancyOnly -XX:NewSize=300M -XX:MaxNewSize=300M"

@netneladno:
netneladno

рекомендую blog.cloudera.com
blog.cloudera.com

@netneladno:
netneladno

пропустил няшечку blog.cloudera.com

@netneladno:
netneladno

Log splitting

As we mentioned in the write path blog post, HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore. In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the updates (also called edits) from the WAL file.

A region server serves many regions. All of the regions in a region server share the same active WAL file. Each edit in the WAL file has information about which region it belongs to. When a region is opened, we need to replay those edits in the WAL file that belong to that region. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting. It is a critical process for recovering data if a region server fails.

@netneladno:
netneladno

Cloudera introduced CDH 4* and Hortonworks introduced HDP 1*, both timed for the recent Hadoop Summit.
CDH 4 is based mainly on Hadoop 2.0, which Cloudera says it has tested extensively.
HDP 1 is based on Hadoop 1.0, on the theory that nobody has properly tested Hadoop 2.0, which is still characterized as “alpha”.
CDH 4 boasts sub-second NameNode failover.
Hortonworks is partnering with third parties such as VMware to address the high-availability problems caused by failover potentially taking several minutes.
Hadoop 2.0 and CDH 4 also incorporate improvements to NameNode scalability, HDFS (Hadoop Distributed File System) performance, HBase performance, and HBase functionality.
As does CDH 4, HDW 1 includes HCatalog, an extension of Hive technology that serves as a more general metadata store.
Hortonworks thinks HCatalog is a big deal in improving Hadoop data management and connectivity, and already has a Talend partnership based on HCatalog. Cloudera is less sure, especially in HCatalog’s current form.
HDP 1 includes Ambari, an Apache open source competitor to Cloudera Manager (the closed-source part of Cloudera Enterprise). Hortonworks concedes a functionality gap between Ambari and Cloudera Manager, but perhaps a smaller one than Cloudera sees.
Hortonworks thinks Ambari being open source means better integration with other management platforms. Cloudera touts the integration features and integrations of Cloudera Manager 4.
Nobody seems confident that MapReduce 2 is ready for prime time. While it’s in CDH 4, so is MapReduce 1.

*”CDH” stands, due to some trademarking weirdness, for “Cloudera’s Distribution including Apache Hadoop”. “HDP” stands for “Hortonworks Data Platform”.

@netneladno:
netneladno

static.usenix.org
отчот про использование hdfs в яху

@korchasa:
korchasa

DSL для HBase, для тех кто устал писать Bytes.toByte() и дублировать семьи и столбцы. github.com

@netneladno:
netneladno

requests=447460, regions=34, usedHeap=1924, maxHeap=8066
на одну ноду если чо

@korchasa:
korchasa

Всего полтора дня секса и удалось заставить работать hbase в распределенном режиме с java-клиентом. Теперь заживем!

@netneladno:
netneladno

Although not a relational database, HBase has the same table, row, and column constructs. Each row has a primary key, which can be any type and determines the sort order of the rows. While HBase provides the capability for secondary indices, they are poorly supported at present. Each cell (the intersection of a row and column) is versioned and can be of any data type. Columns are defined when an HBase database is instantiated, but other columns can be added dynamically. Columns can be grouped into column families, and these column families are stored together on the file system.

A comment from Hadoop: The Definitive Guide, Second Edition brings the difference between HBase and traditional DBMSs into sharp relief, "We currently have tables with hundreds of millions of rows and tens of thousands of columns; the thought of storing billions of rows and millions of columns is exciting, not scary."

Many readers, including the author, will find the idea of a table with "tens of thousands of columns" and no intrinsic metadata as unusable, ungovernable and yes, scary! Just because the technology allows for such a structure does not make it good architectural practice. Database designers should apply and adapt the best practices they employ when designing column-oriented databases when working with HBase's more flexible data structures.

@Kxepal:
Kxepal

horicky.blogspot.com

@Nim:
Nim

в анализе теглайны, для последующего использования в проекте для доставки контент-ориентированных ресурсов + извечные мета-данные. Остальное отбросил. Требование — бинарные данные, распределённость, мастабируемость, доступность, внятный апидок, возможность thrill-python или готовый python-binding (последнее, правда, пока что остаётся в мечтах)