жуйк, посоветуй системку на хадупе (или самостоятельную), которая позволяет делать SQL-подобные запросы по неструктурированным данным (хранятся сейчас в JSON'е, перегнать в любой другой формат не проблема). В принципе, подойдут и запросы по типу монговских, но SQL предпочтительнее в данном случае. Сами запросы должны поддерживать фильтрацию, группировку, сортировку и стандартный набор аггрегирующих функций

Here are the recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster:
12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
64-512GB of RAM
Bonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher the network throughput needed)
коммодити хардваре яебал

Cloudera introduced CDH 4* and Hortonworks introduced HDP 1*, both timed for the recent Hadoop Summit.
CDH 4 is based mainly on Hadoop 2.0, which Cloudera says it has tested extensively.
HDP 1 is based on Hadoop 1.0, on the theory that nobody has properly tested Hadoop 2.0, which is still characterized as “alpha”.
CDH 4 boasts sub-second NameNode failover.
Hortonworks is partnering with third parties such as VMware to address the high-availability problems caused by failover potentially taking several minutes.
Hadoop 2.0 and CDH 4 also incorporate improvements to NameNode scalability, HDFS (Hadoop Distributed File System) performance, HBase performance, and HBase functionality.
As does CDH 4, HDW 1 includes HCatalog, an extension of Hive technology that serves as a more general metadata store.
Hortonworks thinks HCatalog is a big deal in improving Hadoop data management and connectivity, and already has a Talend partnership based on HCatalog. Cloudera is less sure, especially in HCatalog’s current form.
HDP 1 includes Ambari, an Apache open source competitor to Cloudera Manager (the closed-source part of Cloudera Enterprise). Hortonworks concedes a functionality gap between Ambari and Cloudera Manager, but perhaps a smaller one than Cloudera sees.
Hortonworks thinks Ambari being open source means better integration with other management platforms. Cloudera touts the integration features and integrations of Cloudera Manager 4.
Nobody seems confident that MapReduce 2 is ready for prime time. While it’s in CDH 4, so is MapReduce 1.

*”CDH” stands, due to some trademarking weirdness, for “Cloudera’s Distribution including Apache Hadoop”. “HDP” stands for “Hortonworks Data Platform”.

Кто же знал что hadoop после вызоыва Combine полагает, что я отдам результаты в отсортированном порядке... 3 часа не мог понять почему после Combine мой фиктивный ключ, который должен всегда обрабатываться перед основными записями не обрабатывался...

мир программирования на java — это какая-то тоталитарная секта. сплошной заговор против человечества. запускаю хадуп на амазоне и смотрю в логах этого хадупа на эксепшены, которые его убивают прямо на старте. потому что в конфигах что-то не то. а потом сижу и вычисляю, в каком именно конфиге и что не так. почему в лог человеческим языком не написать, что такой-то параметр в таком-то файле надо подкрутить? вместо этого мне под нос суют стектрейс. и это у них в жаве мейнстрим такой.

Let's delve into the subject of graph data in more detail. Recently there was a paper by Rohloff et. al. <dist-systems.bbn.com> that showed how to store graph data (represented in vertex-edge-vertex "triple" format) in Hadoop, and perform sub-graph pattern matching in a scalable fashion over this graph of data. The particular focus of the paper is on Semantic Web graphs (where the data is stored in RDF and the queries are performed in SPARQL), but the techniques presented in the paper are generalizable to other types of graphs. This paper and resulting system (called SHARD) has received significant publicity, including a presentation at HadoopWorld 2010, a presentation at DIDC 2011, and a feature on Cloudera's Website <cloudera.com>. In fact, it is a very nice technique. It leverages Hadoop to scale sub-graph pattern matching (something that has historically be difficult to do); and by aggregating all outgoing edges for a given vertex into the same key-value pair in Hadoop, it even scales queries in a way that is 2-3 times more efficient than the naive way to use Hadoop for the same task.

The only problem is that, as shown by an upcoming VLDB paper that we're releasing today <cs-www.cs.yale.edu>, this technique is an astonishing factor of 1340 times less efficient than an alternative technique for processing sub-graph pattern matching queries within a Hadoop-based system that we introduce in our paper.

Anyhow, Yahoo’s most recent standard Hadoop nodes feature:

8-12 cores
48 gigabytes of RAM
12 disks of 2 or 3 TB each
If you divide 12 by 3 for standard Hadoop redundancy, and take off 1/4, then you have 6-9 TB/node. Multiple that by a compression factor of 6-10X, at least for the “curated data,” and you get to 36-90 TB of user data per node.

As an alternative, suppose we take a point figure from Cloudera’s ranges of 16 TB of spinning disk per node (8 spindles, 2 TB/disk). Go with the 6X compression figure. Lop off 1/3 for temp space. That more conservative calculation leaves us a bit over 20 TB/node, which is probably a more typical figure among today’s Hadoop users.

Hadoop runs on commodity hardware; but, for anything more than a simple proof of concept, commodity does not mean a heterogeneous set of surplus machines. As Hadoop moves from proof of concept to full-blown initiative, plan for success. Data sampling is dead, and the success of a big data analytics initiative cannot be shown with a small-scale pilot. Big data analytics is about meeting unmet commercial needs that cannot be met with existing technology. Given the engineering challenge of building a Hadoop cluster at scale, the hardware implementation should be carefully planned. Many of the engineering solutions associated with cloud computing infrastructure are directly applicable to deploying and operating Hadoop clusters.

Additional space, cooling, and power may be required in existing data centers, and because few hardware vendors provide a Hadoop cluster as a single stock-keeping unit (SKU), consistency of build is important. Server selection should focus on the differing requirements of core nodes (JobTracker, namenodes), data nodes (datanodes, TaskTracker) and edge nodes (data movement in/out of the cluster). Core and edge nodes will require increased reliability, redundancy, and support. Hadoop assumes data nodes will fail, so support for these servers can be reduced.

No standard hardware configuration exists for data nodes. When the initiative starts, it will be unclear if the MapReduce job profile is data or compute centric. Provide headroom in memory, cores, and disk so that as the job profile becomes better understood, the hardware can be adjusted as required. The data-aware nature of HDFS and MapReduce means that a physical, not virtualization, model is preferred.

Network bandwidth may become an issue. Hadoop is rack-aware and reduces network bandwidth as it moves data around during the reduce functions. Consider if the cost of increasing the network bandwidth or provisioning additional nodes is the best approach.

Installing, configuring, and implementing Hadoop requires sophisticated instance management, network security engineering, data partitioning, and job coordination. Consider deploying a cluster management utility to reduce the administration effort involved in providing consistency of build and cluster operations (i.e., monitoring and administration).

A detailed analysis of these topics is contained in the Intel white paper "Optimizing Hadoop Deployments." software.intel.com

Hadoop is exciting in the “Big Data” world because it doesn’t pre-suppose any structure. Data in any structure can be stored in plain files. Applications can read the files and build their own structures on the fly. It is liberating. However, it is not efficient – precisely because it reduces all data to its base form of files and robs the data of its structure – the structure that would allow for efficient processing or storage by applications! Each application has to redo its work from scratch.

Most Hadoop deployments today use systems with dual socket and quad or hex cores (8 or 12 cores total, 16 or 24 hyper-threaded). Storage has increased as well with 6-8 spindles being common and some deployments going to 12 spindles. These are SATA disks with between 1TB and 2TB capacity. The amount of RAM varies depending on the application. 24GB is common as is 36GB – all ECC RAM. HBase clusters may have more RAM so they can cache more data. Some customers put Hadoop on their “standard box” which may not be perfectly balanced (e.g. more RAM, less disk) and needs to be altered slightly to meet the above specs. The new Dell C2100 series and the HP SL170 series are both popular server lines for Hadoop.

Я кстати все больше замечаю в своем круге общения людей, горящих желанием попробовать hadoop. Кому-то надо писать неебические количества данных (даже порядок цифр не могут назвать!), кому-то надо делать аналитику. Петабайты блядь у всех.
мало какая еще есть индустрия, где люди так движимы модой