← All posts tagged hbase

netneladno
PostgreSQL ну наконец-то
With the release of PostgreSQL 9.3, it let’s us do some really cool things with writable foreign tables. BigSQL just release a Hadoop Foreign Data Wrapper that is writable into HDFS files and Hbase tables. The Hbase integration allows for full SELECT, INSERT, UPDATE and DELETE syntax through PostgreSQL and the HDFS integration allows for SELECT and INSERT.The HadoopFDW is released under the PostgreSQL license and can be found here <bi>.
netneladno
hbase поцоны, вы не поверите, если вам надо смигрировать допустим hbase с одного кластера на другой, или с одной версии cdh на другую, то на принимающей стороне надо нажать

hadoop distcp hftp://hadoop-namenode.cluster1/hbase hdfs://hadoop-namenode.cluster2/hbase

и всё. только мапредьюс на принимающей нужен еще
netneladno
hbase поцоны, а расскажите тестировал ли кто Cloudera Impala?
ну и даже попрошу рекомендовать пост
netneladno
hbase i.imgur.com
+export HBASE_OPTS="-Xms24g -ea -Xmx24g -XX:+UseParallelGC -XX:+UseNUMA -XX:+UseParallelOldGC -XX:+UseCompressedOops -XX:MaxGCPauseMillis=400"

-export HBASE_OPTS="-Xms8096m -ea -Xmx8096m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=50 -XX:+UseCMSInitiatingOccupancyOnly -XX:NewSize=300M -XX:MaxNewSize=300M"
netneladno
hadoop, big data, analytics, bigdata, cloudera, data science, data scientist, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing

гоните унижайте
netneladno
NoSQL смотрю вот сюда blog.couchbase.com и вижу что одна нода коучбейза делает 300к селектиков
на кей-валью
и вот думаю, чего людям неймется, ведь что иннодб, что постгрес показывают такие цифры без особых проблем
в них конечно нет ИНКРЕМЕНТАЛ МАПРЕДЬЮСА но бля
netneladno
SQL impala еще немного про новую опердень от клаудеры
As of the initial Impala release(s):
• Impala will run against a variety of storage managers, choices among which will have different performance implications. HDFS (Hadoop Distributed File System) and HBase will both be supported. Multiple HDFS formats will be supported, both row-based and columnar. (See the Trevni comments in my first Impala post.)
• In the simplest of scanning scenarios, Impala can read row-based data at near the theoretically optimum speed, while Hive runs at 1/3 of that.
• Initially, all Impala joins will be (distributed) hash joins. These seem to start at 10X Hive’s performance and go up from there.
• The fastest Impala queries take > 1 second.
• One test showed Impala surviving a load of 100 concurrent queries. Another test showed Impala running 10 cloned copies of a query with 25%ish performance degradation.
• Impala will have Microstrategy support on Day 1, so it obviously can handle fairly complex SQL. (Also Pentaho, Tableau, and QlikView.)
• Column statistics and the like are under active development, which will help in query optimization. A true cost-based optimizer is, of course, further off.
netneladno
hbase ну я вот не знал что&nb Log splitting

As we mentioned in the write path blog post, HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore. In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the updates (also called edits) from the WAL file.

A region server serves many regions. All of the regions in a region server share the same active WAL file. Each edit in the WAL file has information about which region it belongs to. When a region is opened, we need to replay those edits in the WAL file that belong to that region. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting. It is a critical process for recovering data if a region server fails.