【Hadoop五】Word Count实例结果分析

10477777 · 发表于 2016-12-9 09:18:37

如下是运行Word Count的结果，输入了两个小文件，从大小在几K之间。

hadoop@hadoop-Inspiron-3521:~/hadoop-2.5.2/bin$ hadoop jar WordCountMapReduce.jar /users/hadoop/hello/world /users/hadoop/output5
--->/users/hadoop/hello/world
--->/users/hadoop/output5
14/12/15 22:35:40 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/12/15 22:35:41 INFO input.FileInputFormat: Total input paths to process : 2 //一共有两个文件要处理
14/12/15 22:35:41 INFO mapreduce.JobSubmitter: number of splits:2  //两个input splits，每个split对应一个Map Task
14/12/15 22:35:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1418652929537_0001
14/12/15 22:35:43 INFO impl.YarnClientImpl: Submitted application application_1418652929537_0001
14/12/15 22:35:43 INFO mapreduce.Job: The url to track the job: http://hadoop-Inspiron-3521:8088/proxy/application_1418652929537_0001/
14/12/15 22:35:43 INFO mapreduce.Job: Running job: job_1418652929537_0001
14/12/15 22:35:54 INFO mapreduce.Job: Job job_1418652929537_0001 running in uber mode : false
14/12/15 22:35:54 INFO mapreduce.Job:  map 0% reduce 0%
14/12/15 22:36:04 INFO mapreduce.Job:  map 50% reduce 0%
14/12/15 22:36:05 INFO mapreduce.Job:  map 100% reduce 0%
14/12/15 22:36:16 INFO mapreduce.Job:  map 100% reduce 100%
14/12/15 22:36:17 INFO mapreduce.Job: Job job_1418652929537_0001 completed successfully
14/12/15 22:36:17 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=3448
FILE: Number of bytes written=299665
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2574
HDFS: Number of bytes written=1478
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2 //一个输入文件一个Map Task
Launched reduce tasks=1
Data-local map tasks=2 //两个Map Task都是从本地Node读取数据内容
Total time spent by all maps in occupied slots (ms)=17425
Total time spent by all reduces in occupied slots (ms)=8472
Total time spent by all map tasks (ms)=17425
Total time spent by all reduce tasks (ms)=8472
Total vcore-seconds taken by all map tasks=17425
Total vcore-seconds taken by all reduce tasks=8472
Total megabyte-seconds taken by all map tasks=17843200
Total megabyte-seconds taken by all reduce tasks=8675328
Map-Reduce Framework
Map input records=90 //输入的两个文件的一共90行
Map output records=251 //Map输出了251行，也就是说一行有将近3个单词,251/90
Map output bytes=2940
Map output materialized bytes=3454
Input split bytes=263
Combine input records=0
Combine output records=0
Reduce input groups=138
Reduce shuffle bytes=3454
Reduce input records=251
Reduce output records=138
Spilled Records=502
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=274
CPU time spent (ms)=3740
Physical memory (bytes) snapshot=694566912
Virtual memory (bytes) snapshot=3079643136
Total committed heap usage (bytes)=513277952
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=2311 //两个文件的总大小
File Output Format Counters
Bytes Written=1478 //输出文件part-r-00000文件的大小

只有Mapper没有Reducer

把Combiner设置为Reducer的实现，同时设置numReducerTask为0，那么只有mapper有输出，输出文件名为part-m-00000,结果显示：

1.结果既没有排序，也没有对相同的结果做归并，即Combiner没起到作用

设置五个Reducer

输出结果中有五个文件part-r-00000到part-r-00004

设置block的大小

设置block的大小时，需要注意一下两个参数，如下两个参数的约束限制了block大小的设置，要想设置block大小需要依赖于如下两个参数的设置

1.block的大小不能比dfs.namenode.fs-limits.min-block-size设置的块大小更小，默认1048576

2.block的大小需要是io.bytes.per.checksum的整数倍，而io.bytes.per.checksum的默认大小是256字节

要将block大小改为512字节，可以在hdfs-site.xml做如下配置：

<property>
<name>dfs.block.size</name>

<value>512</value>
<description>The default block size for new files.</description>
</property>
<property>
<name>dfs.namenode.fs-limits.min-block-size</name>

<value>256</value>
<description>The minimum of block size</description>
</property

Pro Apache Hadoop（p13）

A map task can run on any compute node in the cluster, and multiple map tasks can run in parallel across the cluster. The map task is responsible for transforming the input records into key/value pairs. The output of all the maps will be partitioned, and each partition will be sorted. There will be one partition for each reduce task. Each partition’s sorted keys and the values associated with the keys are then processed by the reduce task. There can be multiple reduce tasks running in parallel on the cluster.

The key to receiving a list of values for a key in the reduce phase is a phase known as the sort/shuffle phase in MapReduce. All the key/value pairs emitted by the Mapper are sorted by the key in the Reducer. If multiple Reducers are allocated, a subset of keys will be allocated to each Reducer. The key/value pairs for a given Reducer are sorted by key, which ensures that all the values associated with one key are received by the Reducer together.

Word Count Map Reduce过程

p24:

Some of the metadata stored by the NameNode includes these:
· File/directory name and its location relative to the parent directory.
· File and directory ownership and permissions.
· File name of individual blocks. Each block is stored as a file in the local file system of the DataNode in the directory that can be configured by the Hadoop system administrator.

如何查看HDFS上的数据块

如果文件大小不足1个block的size大小，那个这个文件将占用1个block(记录元信息到NameNode)，这个block的实际大小就是文件的大小

p19

The NameNode file that contains the metadata is fsimage. Any changes to the metadata during the system
operation are stored in memory and persisted to another file called edits. Periodically, the edits file is merged with the fsimage file by the Secondary NameNode.

使用如下命令可以查看HDFS的状态

hdfs fsck / -files -blocks -locations |grep /users/hadoop/wordcount -A 30

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] 【Hadoop五】Word Count实例结果分析

浏览过的版块

扫码加入运维网微信交流群