设为首页 收藏本站
查看: 1265|回复: 0

[经验分享] 【Hadoop五】Word Count实例结果分析

[复制链接]

尚未签到

发表于 2016-12-9 09:18:37 | 显示全部楼层 |阅读模式
  如下是运行Word Count的结果,输入了两个小文件,从大小在几K之间。



hadoop@hadoop-Inspiron-3521:~/hadoop-2.5.2/bin$ hadoop jar WordCountMapReduce.jar /users/hadoop/hello/world /users/hadoop/output5
--->/users/hadoop/hello/world
--->/users/hadoop/output5
14/12/15 22:35:40 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/12/15 22:35:41 INFO input.FileInputFormat: Total input paths to process : 2 //一共有两个文件要处理
14/12/15 22:35:41 INFO mapreduce.JobSubmitter: number of splits:2  //两个input splits,每个split对应一个Map Task
14/12/15 22:35:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1418652929537_0001
14/12/15 22:35:43 INFO impl.YarnClientImpl: Submitted application application_1418652929537_0001
14/12/15 22:35:43 INFO mapreduce.Job: The url to track the job: http://hadoop-Inspiron-3521:8088/proxy/application_1418652929537_0001/
14/12/15 22:35:43 INFO mapreduce.Job: Running job: job_1418652929537_0001
14/12/15 22:35:54 INFO mapreduce.Job: Job job_1418652929537_0001 running in uber mode : false
14/12/15 22:35:54 INFO mapreduce.Job:  map 0% reduce 0%
14/12/15 22:36:04 INFO mapreduce.Job:  map 50% reduce 0%
14/12/15 22:36:05 INFO mapreduce.Job:  map 100% reduce 0%
14/12/15 22:36:16 INFO mapreduce.Job:  map 100% reduce 100%
14/12/15 22:36:17 INFO mapreduce.Job: Job job_1418652929537_0001 completed successfully
14/12/15 22:36:17 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=3448
FILE: Number of bytes written=299665
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2574
HDFS: Number of bytes written=1478
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2 //一个输入文件一个Map Task
Launched reduce tasks=1
Data-local map tasks=2 //两个Map Task都是从本地Node读取数据内容
Total time spent by all maps in occupied slots (ms)=17425
Total time spent by all reduces in occupied slots (ms)=8472
Total time spent by all map tasks (ms)=17425
Total time spent by all reduce tasks (ms)=8472
Total vcore-seconds taken by all map tasks=17425
Total vcore-seconds taken by all reduce tasks=8472
Total megabyte-seconds taken by all map tasks=17843200
Total megabyte-seconds taken by all reduce tasks=8675328
Map-Reduce Framework
Map input records=90 //输入的两个文件的一共90行
Map output records=251 //Map输出了251行,也就是说一行有将近3个单词,251/90
Map output bytes=2940
Map output materialized bytes=3454
Input split bytes=263
Combine input records=0
Combine output records=0
Reduce input groups=138
Reduce shuffle bytes=3454
Reduce input records=251
Reduce output records=138
Spilled Records=502
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=274
CPU time spent (ms)=3740
Physical memory (bytes) snapshot=694566912
Virtual memory (bytes) snapshot=3079643136
Total committed heap usage (bytes)=513277952
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=2311   //两个文件的总大小
File Output Format Counters
Bytes Written=1478 //输出文件part-r-00000文件的大小

 

只有Mapper没有Reducer

把Combiner设置为Reducer的实现,同时设置numReducerTask为0,那么只有mapper有输出,输出文件名为part-m-00000,结果显示:

1.结果既没有排序,也没有对相同的结果做归并,即Combiner没起到作用

设置五个Reducer

 输出结果中有五个文件part-r-00000到part-r-00004

设置block的大小

设置block的大小时,需要注意一下两个参数,如下两个参数的约束限制了block大小的设置,要想设置block大小需要依赖于如下两个参数的设置

1.block的大小不能比dfs.namenode.fs-limits.min-block-size设置的块大小更小,默认1048576

2.block的大小需要是io.bytes.per.checksum的整数倍,而io.bytes.per.checksum的默认大小是256字节

要将block大小改为512字节,可以在hdfs-site.xml做如下配置:

<property>
<name>dfs.block.size</name>
<!--<value>67108864</value>-->
<value>512</value>
<description>The default block size for new files.</description>
</property>
<property>
<name>dfs.namenode.fs-limits.min-block-size</name>
<!--<value>67108864</value>-->
<value>256</value>
<description>The minimum of block size</description>
</property
 

 Pro Apache Hadoop(p13)

A map task can run on any compute node in the cluster, and multiple map tasks can run in parallel across the cluster. The map task is responsible for transforming the input records into key/value pairs. The output of all the maps will be partitioned, and each partition will be sorted. There will be one partition for each reduce task. Each partition’s sorted keys and the values associated with the keys are then processed by the reduce task. There can be multiple reduce tasks running in parallel on the cluster.

The key to receiving a list of values for a key in the reduce phase is a phase known as the sort/shuffle phase in MapReduce. All the key/value pairs emitted by the Mapper are sorted by the key in the Reducer. If multiple Reducers are allocated, a subset of keys will be allocated to each Reducer. The key/value pairs for a given Reducer are sorted by key, which ensures that all the values associated with one key are received by the Reducer together.

 

 Word Count Map Reduce过程

 

 
DSC0000.png
 

 p24:

Some of the metadata stored by the NameNode includes these:
· File/directory name and its location relative to the parent directory.
· File and directory ownership and permissions.
· File name of individual blocks. Each block is stored as a file in the local file system of the DataNode in the directory that can be configured by the Hadoop system administrator.

如何查看HDFS上的数据块

 如果文件大小不足1个block的size大小,那个这个文件将占用1个block(记录元信息到NameNode),这个block的实际大小就是文件的大小

 

p19

The NameNode file that contains the metadata is fsimage. Any changes to the metadata during the system
operation are stored in memory and persisted to another file called edits. Periodically, the edits file is merged with the fsimage file by the Secondary NameNode.

 

使用如下命令可以查看HDFS的状态

hdfs  fsck / -files -blocks -locations |grep /users/hadoop/wordcount -A 30

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-311762-1-1.html 上篇帖子: 如何将Lucene索引写入Hadoop? 下篇帖子: (4)通过调用hadoop的java api实现本地文件上传到hadoop文件系统上
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表