In a typical Hadoop MapReduce job, input files are read from HDFS. Data are usually compressed to reduce the file sizes. After decompression, serialized bytes are transformed into Java objects before being passed to a user-defined map() function. Conversely, output records are serialized, compressed, and eventually pushed back to HDFS. This seemingly simple, two-way process is in fact much more complicated due to a few reasons:
Optimizing Input Pipeline
Adding everything up, including a "copy" for decompressing bytes, the whole read pipeline involves seven buffer-copies to deliver a record to MapTask's map() function since data are received in the process's kernel buffer. There are a couple of things that could be improved in the above process:
优化输入管道
•Many buffer-copies are needed simply to convert between direct buffer and byte[] buffer.
•Checksum calculation can be done in bulk instead of one chunk at a time.
Figure 2 shows the post-optimization view where the total number of buffer copies is reduced from seven to three:
1.An input packet is decomposed into the checksum part and the data part, which are scattered into two direct buffers: an internal one for checksum bytes, and the direct buffer owned by the decompression layer to hold compressed bytes. The FSInputChecker accesses both buffers directly.
2.The decompression layer deflates the uncompressed bytes to a direct buffer owned by the LineReader.
3.LineReader scans the bytes in the direct buffer, finds the line separators from the buffer, and constructs Text objects.