005_hadoop中MapReduce详解_2

hege · 发表于 2016-12-11 06:44:44

　　前面介绍了的MapReduce的入门。利用了一个hadoop自带的例子来说明MapReduce的流程。现在我们自己动手写一个小例子来锻炼一下。
　　问题描述：现在有一个文件，文件内容如下：
　　黄晓明 89
　　刘杰 48
　　黄晓明 78
　　郑爽 90
　　……
　　求学生的平均成绩？
　　分析：
　　1.在Map阶段我们的输入可以每行读取，生成类似<行号,行内容>即：
　　<1,黄晓明 89>
　　<2,刘杰 48>
　　<3,黄晓明 78>
　　等等。
　　2.然后这些记录进入Map函数。我们要充分利用Map--->洗牌--->Reduce这中间的洗牌操作。将Key相同的放在一起。Value变成一个List
　　3.构建Map的输出（即Reduce的输入）
　　<黄晓明,89>
　　<刘杰,48>
　　<黄晓明,78>
　　等等。
　　4.遍历相同Key的Value值，进行累加，求得总分数，然后除以科目总数，这样就得到最后结果
　　代码实现：

//继承Mapper，重新Map方法
public static class MyMap extends Mapper<LongWritable, Text, Text, IntWritable>{
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
System.out.println(line);
String [] arr = line.split(" ");
Text name = new Text(arr[0]);
String score = arr[1];
IntWritable s = new IntWritable(Integer.parseInt(score));
context.write(name, s);
}
}
　　Mapper处理的数据是由InputFormat分解过的数据集，其中InputFormat的作用是将数据集切割成小的InputSplit，每个InputSplit都由一个Map去处理。

//继承Reducer，重写reduce方法
public static class MyReduce extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
protected void reduce(Text arg0, Iterable<IntWritable> arg1, Context arg2) throws IOException, InterruptedException {
int sum = 0 ;
int count = 0 ;
Iterator<IntWritable> it = arg1.iterator();
while(it.hasNext()){
sum += it.next().get();
count ++;
}
int ave = (int)sum/count;
arg2.write(arg0, new IntWritable(ave));
}
}
　　测试程序

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "average");
job.setJarByClass(TestMR.class);
job.setMapperClass(MyMap.class);
job.setCombinerClass(MyReduce.class);
job.setReducerClass(MyReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/user/text.txt")); //Map的输入
FileOutputFormat.setOutputPath(job, new Path("/user/helloMR/success"));//Reduce的输出
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
　　源码我已经上传了
　　是不是非常简单的一个例子
　　在开发上面代码时，分别覆盖了Mapper里面的map方法和Reducer里面的reduce方法。我们还发现有其他方法可以覆盖
　　

既然看到了，就简单说一下其他的三个方法：
　　1.setup：setup函数是在task启动开始就调用
　　2.cleanup：cleanup函数是在task销毁之前执行
　　3.run：run函数是Map类或者Reduce类的启动方法：先调用setup函数，然后针对每个key调用一次Map或者Reduce，最后在销毁前执行一次cleanup函数

性能调优
　　性能方面入手无非就是两个方面：时间；空间
　　尽量最快执行；尽量用最少的空间
　　调优可从以下几个方面入手：
　　A.输入采用大文件
　　hadoop习惯处理大文件，讨厌小文件。所以在输入时尽量采用少量的大文件，避免使用大量的小文件。
　　B.压缩文件
　　对Map处理的结果进行压缩，即可减少空间的存储，又可以减少在网络中的数据传输
　　C.过滤数据
　　在处理Map前先对数据进行过滤，例如：一个表巨大，一个表比较小。当两个表做等值连接时，是不是应该采用小表里面的数据来对比大表里面数据。在这里有个Bloom Filte。大家可以自己去学习
　　D.调整作业属性
　　调整Map函数的个数和Reduce函数的个数，达到最佳
　　E.Combine函数
　　Conbine函数用于本地合并，这会大大减少网络I/O操作的消耗

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] 005_hadoop中MapReduce详解_2

浏览过的版块

扫码加入运维网微信交流群