|
We used Apache Hadoopto compete in Jim Gray's Sortbenchmark. Jim's Gray's sort benchmark consists of a set of manyrelated benchmarks, each with their own rules. All of the sortbenchmarks measure the time to sort different numbers of 100 byterecords. The first 10 bytes of each record is the key and the rest isthe value. The minute sortmust finish end to end in less than a minute. The Gray sortmust sort more than 100 terabytes and must run for at least an hour. The best times we observed were:
BytesNodesMapsReducesReplicationTime500,000,000,000140680002600159 seconds1,000,000,000,000146080002700162 seconds100,000,000,000,0003452190,00010,0002173 minutes1,000,000,000,000,000365880,00020,0002975 minutes Within the rules for the 2009 Gray sort, our 500 GB sort set a newrecord for the minute sort and the 100 TB sort set a new record of0.578 TB/minute. The 1 PB sort ran after the 2009 deadline, butimproves the speed to 1.03 TB/minute. The 62 second terabyte sort wouldhave set a new record, but the terabyte benchmark that we won last yearhas been retired. (Clearly the minute sort and terabyte sort arerapidly converging, and thus it is not a loss.) One piece of trivia isthat only the petabyte dataset had any duplicate keys (40 of them).
We ran our benchmarks on Yahoo's Hammer cluster. Hammer's hardwareis very similar to the hardware that we used in last year's terabytesort. The hardware and operating system details are:
- approximately 3800 nodes (in such a large cluster, nodes are always down)
- 2 quad core Xeons @ 2.5ghz per node
- 4 SATA disks per node
- 8G RAM per node (upgraded to 16GB before the petabyte sort)
- 1 gigabit ethernet on each node
- 40 nodes per rack
- 8 gigabit ethernet uplinks from each rack to the core
- Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
- Sun Java JDK (1.6.0_05-b13 and 1.6.0_13-b03) (32 and 64 bit)
We hit a JVM bug that caused a core dump in 1.6.0_05-b13 on thelarger sorts (100TB and 1PB) and switched over to the later JVM, whichresolved the issue. For the larger sorts, we used 64 bit JVMs for theName Node and Job Tracker.
Because the smaller sorts needed lower latency and faster network,we only used part of the cluster for those runs. In particular, insteadof our normal 5:1 over subscription between racks, we limited it to 16nodes in each rack for a 2:1 over subscription. The smaller runs canalso use output replication of 1, because they only take minutes to runand run on smaller clusters, the likelihood of a node failing is fairlylow. On the larger runs, failure is expected and thus replication of 2is required. HDFS protects against data loss during rack failure bywriting the second replica on a different rack and thus writing thesecond replica is relatively slow.
Below are the timelines for the jobs counting from the jobsubmission at the Job Tracker. The diagrams show the number of tasksrunning at each point in time. While maps only have a single phase, thereduces have three: shuffle, merge, and reduce.The shuffle is the transfer of the data from the maps. Merge doesn'thappen in these benchmarks, because none of the reduces need multiplelevels of merges. Finally, the reduce phase is where the final mergeand writing to HDFS happens. I've also included a category named wastethat represents task attempts that were running, but ended up eitherfailing, or being killed (often as speculatively executed taskattempts).
If you compare this years charts to last year's, you'll notice thattasks are launching much faster now. Last year we only launched onetask per heartbeat, so it took 40 seconds to get all of the taskslaunched. Now, Hadoop will fill up a Task Tracker in a singleheartbeat. Reducing that job launch overhead is very important forgetting runs under a minute.
As with last year, we ran with significantly larger tasks than thedefaults for Hadoop. Even with the new more aggressive shuffle,minimizing the number of transfers (maps * reduces) is very importantto the performance of the job. Notice that in the petabyte sort, eachmap is processing 15 GB instead of the default 128 MB and each reduceis handling 50 GB. When we ran the petabyte with more typical values1.5 GB / map, it took 40 hours to finish. Therefore, to increasethroughput, it makes sense to consider increasing the default blocksize, which translates into the default map size, to at least up to 1GB.
We used a branch of trunk with some modifications that will bepushed back into trunk. The primary ones are that we reimplementedshuffle to re-use connections, and we reduced latencies and madetimeouts configurable. More details including the changes we made toHadoop are available in our reporton the results.
-- Owen O'Malley and Arun Murthy
Posted at May 11, 2009 3:00 PM |
|