Hadoop阅读笔记（三）——深入MapReduce排序和单表连接

renheshi · 发表于 2015-7-12 10:13:23

　　　　继上篇了解了使用MapReduce计算平均数以及去重后，我们再来一探MapReduce在排序以及单表关联上的处理方法。
在MapReduce系列的第一篇就有说过，MapReduce不仅是一种分布式的计算方法，更是一种解决问题的新思维、新思路。将原先看似可以一条龙似的处理一刀切成两端，一端是Map、一端是Reduce，Map负责分，Reduce负责合。
　　　　1.MapReduce排序
　　问题模型：
　　给出多个数据文件输入如：
　　　　sortfile1.txt

11
13
15
17
19
21
23
25
27
29

　　　sortfile2.txt

10
12
14
16
18
20
22
24
26
28
30

　　sortfile3.txt

1
2
3
4
5
6
7
8
9
10

　　最终要完成排序形成结果文件格式如下：

1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 10
……

　　
　　　要解决的问题有了，那么如何排序，如何实现，我们尚且存在哪些问题：
　　1.我们知道MapReduce本身就有自带的排序功能，能够直接用;
　　2.如果用MapReduce默认排序功能，如何使用，针对key为int类型以及String类型又有何不同;
　　3.如何保证三个输入文件乃至更多个输入文件的输入，使得在排序结果中全局有序
　　　　实际需求有了，问题也来了，那么需要一一解决。MapReduce确实有自己的排序机制，我们不会排开不用，但是不能完全靠内部机制实现。要知道MapReduce是根据key进行排序的，如果key为int类型，则按照key的数值大小排序；如果key为String类型，则按照字典先后顺序进行排序。为了保证这里的全局有序，需要定义一个自己的Partition类，起到一个全局筛选的作用，是的map过后的分配到reduce端的都是有序的。具体做法就是用输入数据的最大值除以系统partition数量的商作为分割数据的边界增量，也就是说分割数据的边界为此商的1倍、2倍至numPartitions-1倍，这样就能保证执行partition后的数据是整体有序的。之后，在Reduce端得到的，根据value-list中的元素个数将输入的key作为value的输出次数，输出的key是一个全局变量，用于统计当前的位次。
　　具体代码如下：

public class Sort {
//map将输入中的value转化成IntWritable类型，作为输出的key
public static class Map extends Mapper{
private static IntWritable data = new IntWritable();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
System.out.println("line:" + line);
try{
data.set(Integer.parseInt(line));
}catch(Exception e){
data.set(1000);
}
System.out.println("Map key:" + data .toString() );
context.write(data, new IntWritable(1));
}
}
//reduce将输入的key复制到输出的value，然后根据输入的
//value-list中元素的个数决定key的输出次数
//用全局linenum来代表key的位次
public static class Reduce extends Reducer {
private static IntWritable linenum = new IntWritable(1);
public void reduce(IntWritable key, Iterable values, Context context) throws IOException, InterruptedException {
for (IntWritable val : values) {
context.write(linenum , key);
System.out.println("Reduce key:" + linenum  + "\tReduce value:" + key );
linenum = new IntWritable(linenum.get() + 1);
}
}
}
//自定义Partition函数，此函数根据输入数据的最大值和MapReduce框架中
//Partition的数量获取将输入数据按照大小分块的边界，然后根据输入数值和
//边界的关系返回对应的Partition ID
public static class Partition extends Partitioner  {
@Override
public int getPartition(IntWritable key, IntWritable value, int numPartitions) {
int Maxnumber = 65223;
int bound = Maxnumber/numPartitions + 1;
int keynumber = key.get();
for(int i = 0; i < numPartitions; i++){
System.out.println("numPartitions:" + numPartitions);
if(keynumber < bound*i && keynumber >= bound*(i-1))
return i-1;
}
return 0;
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount  ");
System.exit(2);
}
Job job = new Job(conf, "Sort");
job.setJarByClass(Sort.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setPartitionerClass(Partition.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

　　
　　　　注意：
　　1.在自己新建测试数据的时候，需要小心处理，比如在sortfile1.txt中一共是10行数据，如果将换行符停留在第11行，则在map阶段会抛出格式转换异常，所以添加代码中try catch处理。
　　2.为了更清晰的看出MapReduce以及Partition的执行过程，通过打印信息来了解每一个执行过程。
　　3.Reduce中应该是“return 0”，圣经《hadoop 实战2》中写成了return -1，实践证明是有错误的
　　　　程序执行，打印信息如下：

15/01/28 21:19:28 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/01/28 21:19:28 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
15/01/28 21:19:28 INFO input.FileInputFormat: Total input paths to process : 3
15/01/28 21:19:29 INFO mapred.JobClient: Running job: job_local_0001
15/01/28 21:19:29 INFO input.FileInputFormat: Total input paths to process : 3
15/01/28 21:19:29 INFO mapred.MapTask: io.sort.mb = 100
15/01/28 21:19:29 INFO mapred.MapTask: data buffer = 79691776/99614720
15/01/28 21:19:29 INFO mapred.MapTask: record buffer = 262144/327680
line:11
15/01/28 21:19:29 INFO mapred.MapTask: Starting flush of map output
Map key:11
numPartitions:1
line:13
Map key:13
numPartitions:1
line:15
Map key:15
numPartitions:1
line:17
Map key:17
numPartitions:1
line:19
Map key:19
numPartitions:1
line:21
Map key:21
numPartitions:1
line:23
Map key:23
numPartitions:1
line:25
Map key:25
numPartitions:1
line:27
Map key:27
numPartitions:1
line:29
Map key:29
numPartitions:1
15/01/28 21:19:29 INFO mapred.MapTask: Finished spill 0
15/01/28 21:19:29 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
15/01/28 21:19:29 INFO mapred.LocalJobRunner:
15/01/28 21:19:29 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
15/01/28 21:19:29 INFO mapred.MapTask: io.sort.mb = 100
15/01/28 21:19:29 INFO mapred.MapTask: data buffer = 79691776/99614720
15/01/28 21:19:29 INFO mapred.MapTask: record buffer = 262144/327680
line:10
Map key:10
numPartitions:1
line:12
Map key:12
numPartitions:1
line:14
Map key:14
numPartitions:1
line:16
Map key:16
numPartitions:1
line:18
Map key:18
numPartitions:1
line:20
Map key:20
numPartitions:1
line:22
Map key:22
numPartitions:1
line:24
Map key:24
numPartitions:1
line:26
Map key:26
numPartitions:1
line:28
Map key:28
numPartitions:1
line:30
Map key:30
numPartitions:1
15/01/28 21:19:29 INFO mapred.MapTask: Starting flush of map output
15/01/28 21:19:29 INFO mapred.MapTask: Finished spill 0
15/01/28 21:19:29 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
15/01/28 21:19:29 INFO mapred.LocalJobRunner:
15/01/28 21:19:29 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000001_0' done.
15/01/28 21:19:29 INFO mapred.MapTask: io.sort.mb = 100
15/01/28 21:19:30 INFO mapred.JobClient:  map 100% reduce 0%
15/01/28 21:19:30 INFO mapred.MapTask: data buffer = 79691776/99614720
15/01/28 21:19:30 INFO mapred.MapTask: record buffer = 262144/327680
line:1
Map key:1
numPartitions:1
line:2
Map key:2
numPartitions:1
line:3
Map key:3
numPartitions:1
line:4
Map key:4
numPartitions:1
line:5
Map key:5
numPartitions:1
line:6
Map key:6
numPartitions:1
line:7
Map key:7
numPartitions:1
line:8
Map key:8
numPartitions:1
line:9
Map key:9
numPartitions:1
line:10
Map key:10
numPartitions:1
15/01/28 21:19:30 INFO mapred.MapTask: Starting flush of map output
15/01/28 21:19:30 INFO mapred.MapTask: Finished spill 0
15/01/28 21:19:30 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting
15/01/28 21:19:30 INFO mapred.LocalJobRunner:
15/01/28 21:19:30 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000002_0' done.
15/01/28 21:19:30 INFO mapred.LocalJobRunner:
15/01/28 21:19:30 INFO mapred.Merger: Merging 3 sorted segments
15/01/28 21:19:30 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 316 bytes
15/01/28 21:19:30 INFO mapred.LocalJobRunner:
Reduce key:1Reduce value:1
Reduce key:2Reduce value:2
Reduce key:3Reduce value:3
Reduce key:4Reduce value:4
Reduce key:5Reduce value:5
Reduce key:6Reduce value:6
Reduce key:7Reduce value:7
Reduce key:8Reduce value:8
Reduce key:9Reduce value:9
Reduce key:10Reduce value:10
Reduce key:11Reduce value:10
Reduce key:12Reduce value:11
Reduce key:13Reduce value:12
Reduce key:14Reduce value:13
Reduce key:15Reduce value:14
Reduce key:16Reduce value:15
Reduce key:17Reduce value:16
Reduce key:18Reduce value:17
Reduce key:19Reduce value:18
Reduce key:20Reduce value:19
Reduce key:21Reduce value:20
Reduce key:22Reduce value:21
Reduce key:23Reduce value:22
Reduce key:24Reduce value:23
Reduce key:25Reduce value:24
Reduce key:26Reduce value:25
Reduce key:27Reduce value:26
Reduce key:28Reduce value:27
Reduce key:29Reduce value:28
Reduce key:30Reduce value:29
Reduce key:31Reduce value:30
15/01/28 21:19:30 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
15/01/28 21:19:30 INFO mapred.LocalJobRunner:
15/01/28 21:19:30 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
15/01/28 21:19:30 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://hadoop:9000/usr/hadoop/output3
15/01/28 21:19:30 INFO mapred.LocalJobRunner: reduce > reduce
15/01/28 21:19:30 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
15/01/28 21:19:31 INFO mapred.JobClient:  map 100% reduce 100%
15/01/28 21:19:31 INFO mapred.JobClient: Job complete: job_local_0001
15/01/28 21:19:31 INFO mapred.JobClient: Counters: 14
15/01/28 21:19:31 INFO mapred.JobClient: FileSystemCounters
15/01/28 21:19:31 INFO mapred.JobClient:    FILE_BYTES_READ=67220
15/01/28 21:19:31 INFO mapred.JobClient:    HDFS_BYTES_READ=261
15/01/28 21:19:31 INFO mapred.JobClient:    FILE_BYTES_WRITTEN=138115
15/01/28 21:19:31 INFO mapred.JobClient:    HDFS_BYTES_WRITTEN=168
15/01/28 21:19:31 INFO mapred.JobClient: Map-Reduce Framework
15/01/28 21:19:31 INFO mapred.JobClient:    Reduce input groups=30
15/01/28 21:19:31 INFO mapred.JobClient:    Combine output records=0
15/01/28 21:19:31 INFO mapred.JobClient:    Map input records=31
15/01/28 21:19:31 INFO mapred.JobClient:    Reduce shuffle bytes=0
15/01/28 21:19:31 INFO mapred.JobClient:    Reduce output records=31
15/01/28 21:19:31 INFO mapred.JobClient:    Spilled Records=62
15/01/28 21:19:31 INFO mapred.JobClient:    Map output bytes=248
15/01/28 21:19:31 INFO mapred.JobClient:    Combine input records=0
15/01/28 21:19:31 INFO mapred.JobClient:    Map output records=31
15/01/28 21:19:31 INFO mapred.JobClient:    Reduce input records=31

　　
　　通过打印信息我们知道：
　　Map output records=31 Reduce input records=31
　　首先执行了Map，进行数据逐行输入，然后执行Partition过程，给每个元素打上唯一标记，确保进入Reduce阶段时整齐有序，最后执行Reduce阶段，完成全局排序过程。
　　　　最终的输出文件信息：

11
22
33
44
55
66
77
88
99
1010
1110
1211
1312
1413
1514
1615
1716
1817
1918
2019
2120
2221
2322
2423
2524
2625
2726
2827
2928
3029
3130

　　　其实MapReduce的排序就是这么easy，先是让所有的人都进来，按照map的指定格式写入context，再经过partition全局指挥官的考量，打上排序的标记，最后在reduce中完成最终排序结果的输出。
　　
　　　　2.MapReduce单表关联
　　问题模型，给出多个输入文件如下：
　　　　table1.txt

大儿子爸爸
小儿子爸爸
大女儿爸爸
小女儿爸爸
爸爸爷爷
爸爸二大爷
爸爸三大爷

　　　table2.txt

二女儿妈妈
二儿子妈妈
妈妈爷爷
妈妈二大爷
妈妈三大爷

　　　　最终要得到的数据形式为：

grandchildgrandparent
二女儿爷爷
二女儿二大爷
二女儿三大爷
二儿子爷爷
二儿子二大爷
……

　　
　　　　MapReduce下的表与表或者表与自身的连接不会像传统SQL语句那样直接一个left join、right join就能出一个最终表，鉴于本场景的需求，需要进行表连接，一个左表、一个右表，都是同一张表，连接的条件是左表的parent列以及右表的child列，整个过程就是一个自连接过程。
　　我们的解决思路如下：
　　1.Map端将输入数据分割为parent和child列，将parent设置为key，child设置为value输出，记为左表；再将同意对child和parent中的child设为key，parent设为value输出，记为右表
　　2.为了区分左右表，需要在输出的value中添加有关左右表的标示信息
　　3.在Reduce接收到的经过shuffle过程的结果中，每个key的value-list就包含了grandchild和grandparent关系，取出每个key的value-list进行解析，将左表的child放入一个数组，右表中的parent放入一个数组，然后对这两个数据求笛卡尔积就是最终结果
　　　　代码如下：

public class STjoin {
public static int time = 0;
//map将输入分割成child和parent，然后正序输出一次作为右表，反//序输出一次作为左表，需要注意的是在输出的value中必须加上左右表//区别标志
public static class Map extends Mapper{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String childname = new String();
String parentname = new String();
String relationtype = new String();
String line = value.toString();
int i = 0;
while(line.charAt(i)!=' '){
i++;
}
String[] values = {line.substring(0,i),line.substring(i+1)};
if(values[0].compareTo("child") != 0)
{
childname = values[0];
parentname = values[1];
relationtype = "1";  //左右表区分标志
context.write(new Text(values[1]), new Text(relationtype + "+" + childname + "+" + parentname));
System.out.println("左表 Map  key：" + new Text(values[1]) + "\tvalue:" + (relationtype + "+" + childname + "+" + parentname) );
//左表
relationtype = "2";
context.write(new Text(values[0]), new Text(relationtype + "+" + childname + "+" + parentname));
System.out.println("右表 Map  key：" + new Text(values[0]) + "\tvalue:" + (relationtype + "+" + childname + "+" + parentname) );
//右表
}
}
}
public static class Reduce extends Reducer {
public void reduce(Text key, Iterable values,Context context) throws IOException, InterruptedException {
if(time == 0){ //输出表头
context.write(new Text("grandchild"),new Text("grandparent"));
time++;
}
int grandchildnum = 0;
String grandchild[] = new String[10];
int grandparentnum = 0;
String grandparent[] = new String[10];
Iterator ite = values.iterator();
while(ite.hasNext())
{
String record = ite.next().toString();
int len = record.length();
int i = 2;
if(len == 0) continue;
char relationtype = record.charAt(0);
String childname = new String();
String parentname = new String();
//获取value-list中value的child
while(record.charAt(i) != '+')
{
childname = childname + record.charAt(i);
i++;
}
i = i+1;
//获取value-list中value的parent
while(i < len)
{
parentname = parentname + record.charAt(i);
i++;
}
//左表，取出child放入grandchild
if(relationtype == '1'){
grandchild[grandchildnum] = childname;
grandchildnum++;
}
else{//右表，取出parent放入grandparent
grandparent[grandparentnum] = parentname;
grandparentnum++;
}
}
//grandchild和grandparent数组求笛卡儿积
if(grandparentnum != 0 && grandchildnum != 0){
for(int m = 0; m < grandchildnum; m++){
for(int n = 0; n < grandparentnum; n++){
context.write(new Text(grandchild[m]),new Text(grandparent[n])); //输出结果
System.out.println("Reduce 孙子：" + grandchild[m] + "\t 爷爷：" +  grandparent[n]);
}
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount  ");
System.exit(2);
}
Job job = new Job(conf, "single table join");
job.setJarByClass(STjoin.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

　　代码写的很明白，为了弄清楚MapReduce每一步还是加入了打印信息，程序执行的过程信息如下：

15/01/28 22:06:28 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/01/28 22:06:28 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
15/01/28 22:06:28 INFO input.FileInputFormat: Total input paths to process : 2
15/01/28 22:06:28 INFO mapred.JobClient: Running job: job_local_0001
15/01/28 22:06:28 INFO input.FileInputFormat: Total input paths to process : 2
15/01/28 22:06:28 INFO mapred.MapTask: io.sort.mb = 100
15/01/28 22:06:28 INFO mapred.MapTask: data buffer = 79691776/99614720
15/01/28 22:06:28 INFO mapred.MapTask: record buffer = 262144/327680
左表 Map  key：爸爸value:1+大儿子+爸爸
右表 Map  key：大儿子value:2+大儿子+爸爸
左表 Map  key：爸爸value:1+小儿子+爸爸
右表 Map  key：小儿子value:2+小儿子+爸爸
左表 Map  key：爸爸value:1+大女儿+爸爸
右表 Map  key：大女儿value:2+大女儿+爸爸
左表 Map  key：爸爸value:1+小女儿+爸爸
右表 Map  key：小女儿value:2+小女儿+爸爸
左表 Map  key：爷爷value:1+爸爸+爷爷
右表 Map  key：爸爸value:2+爸爸+爷爷
左表 Map  key：二大爷value:1+爸爸+二大爷
右表 Map  key：爸爸value:2+爸爸+二大爷
左表 Map  key：三大爷 value:1+爸爸+三大爷
右表 Map  key：爸爸value:2+爸爸+三大爷
15/01/28 22:06:28 INFO mapred.MapTask: Starting flush of map output
15/01/28 22:06:28 INFO mapred.MapTask: Finished spill 0
15/01/28 22:06:28 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
15/01/28 22:06:28 INFO mapred.LocalJobRunner:
15/01/28 22:06:28 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
15/01/28 22:06:28 INFO mapred.MapTask: io.sort.mb = 100
15/01/28 22:06:28 INFO mapred.MapTask: data buffer = 79691776/99614720
15/01/28 22:06:28 INFO mapred.MapTask: record buffer = 262144/327680
左表 Map  key：妈妈value:1+二女儿+妈妈
右表 Map  key：二女儿value:2+二女儿+妈妈
左表 Map  key：妈妈value:1+二儿子+妈妈
右表 Map  key：二儿子value:2+二儿子+妈妈
左表 Map  key：爷爷value:1+妈妈+爷爷
右表 Map  key：妈妈value:2+妈妈+爷爷
左表 Map  key：二大爷value:1+妈妈+二大爷
右表 Map  key：妈妈value:2+妈妈+二大爷
左表 Map  key：三大爷value:1+妈妈+三大爷
右表 Map  key：妈妈value:2+妈妈+三大爷
15/01/28 22:06:28 INFO mapred.MapTask: Starting flush of map output
15/01/28 22:06:28 INFO mapred.MapTask: Finished spill 0
15/01/28 22:06:28 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
15/01/28 22:06:28 INFO mapred.LocalJobRunner:
15/01/28 22:06:28 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000001_0' done.
15/01/28 22:06:28 INFO mapred.LocalJobRunner:
15/01/28 22:06:28 INFO mapred.Merger: Merging 2 sorted segments
15/01/28 22:06:28 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 697 bytes
15/01/28 22:06:28 INFO mapred.LocalJobRunner:
Reduce 孙子：二女儿爷爷：爷爷
Reduce 孙子：二女儿爷爷：二大爷
Reduce 孙子：二女儿爷爷：三大爷
Reduce 孙子：二儿子爷爷：爷爷
Reduce 孙子：二儿子爷爷：二大爷
Reduce 孙子：二儿子爷爷：三大爷
Reduce 孙子：大儿子爷爷：爷爷
Reduce 孙子：大儿子爷爷：二大爷
Reduce 孙子：大儿子爷爷：三大爷
Reduce 孙子：小儿子爷爷：爷爷
Reduce 孙子：小儿子爷爷：二大爷
Reduce 孙子：小儿子爷爷：三大爷
Reduce 孙子：大女儿爷爷：爷爷
Reduce 孙子：大女儿爷爷：二大爷
Reduce 孙子：大女儿爷爷：三大爷
Reduce 孙子：小女儿爷爷：爷爷
Reduce 孙子：小女儿爷爷：二大爷
Reduce 孙子：小女儿爷爷：三大爷
15/01/28 22:06:28 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
15/01/28 22:06:28 INFO mapred.LocalJobRunner:
15/01/28 22:06:28 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
15/01/28 22:06:28 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://hadoop:9000/usr/hadoop/output4
15/01/28 22:06:28 INFO mapred.LocalJobRunner: reduce > reduce
15/01/28 22:06:28 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
15/01/28 22:06:29 INFO mapred.JobClient:  map 100% reduce 100%
15/01/28 22:06:29 INFO mapred.JobClient: Job complete: job_local_0001
15/01/28 22:06:29 INFO mapred.JobClient: Counters: 14
15/01/28 22:06:29 INFO mapred.JobClient: FileSystemCounters
15/01/28 22:06:29 INFO mapred.JobClient:    FILE_BYTES_READ=50580
15/01/28 22:06:29 INFO mapred.JobClient:    HDFS_BYTES_READ=515
15/01/28 22:06:29 INFO mapred.JobClient:    FILE_BYTES_WRITTEN=103312
15/01/28 22:06:29 INFO mapred.JobClient:    HDFS_BYTES_WRITTEN=369
15/01/28 22:06:29 INFO mapred.JobClient: Map-Reduce Framework
15/01/28 22:06:29 INFO mapred.JobClient:    Reduce input groups=12
15/01/28 22:06:29 INFO mapred.JobClient:    Combine output records=0
15/01/28 22:06:29 INFO mapred.JobClient:    Map input records=12
15/01/28 22:06:29 INFO mapred.JobClient:    Reduce shuffle bytes=0
15/01/28 22:06:29 INFO mapred.JobClient:    Reduce output records=19
15/01/28 22:06:29 INFO mapred.JobClient:    Spilled Records=48
15/01/28 22:06:29 INFO mapred.JobClient:    Map output bytes=645
15/01/28 22:06:29 INFO mapred.JobClient:    Combine input records=0
15/01/28 22:06:29 INFO mapred.JobClient:    Map output records=24
15/01/28 22:06:29 INFO mapred.JobClient:    Reduce input records=24

　　　最终得到的文件就是打印信息中的输出信息：

grandchildgrandparent
二女儿爷爷
二女儿二大爷
二女儿三大爷
二儿子爷爷
二儿子二大爷
二儿子三大爷
大儿子爷爷
大儿子二大爷
大儿子三大爷
小儿子爷爷
小儿子二大爷
小儿子三大爷
大女儿爷爷
大女儿二大爷
大女儿三大爷
小女儿爷爷
小女儿二大爷
小女儿三大爷

　　如果觉得有用，记得点赞哦，也欢迎加入大数据群413471695进行技术讨论^_^
　　本篇链接：《Hadoop阅读笔记（三）——深入MapReduce排序和单表连接》
　　

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] Hadoop阅读笔记（三）——深入MapReduce排序和单表连接

浏览过的版块

扫码加入运维网微信交流群