hadoop 单表关联

23rew · 发表于 2014-6-18 08:58:06

恩，说真的，原来的那篇文章对于这个单表处理有点搞复杂了。
4、单表关联 前面的实例都是在数据上进行一些简单的处理，为进一步的操作打基础。"单表关联"这个实例要求从给出的数据中寻找所关心的数据，它是对原始数据所包含信息的挖掘。下面进入这个实例。
4.1 实例描述 实例中给出child-parent（孩子——父母）表，要求输出grandchild-grandparent（孙子——爷奶）表。
样例输入如下所示。
file：

Tom       Lucy
Tom       Jack
Jone       Lucy
Jone       Jack
Lucy       Mary
Lucy       Ben
Jack       Alice
Jack       Jesse
Terry       Alice
Terry       Jesse
Philip       Terry
Philip       Alma
Mark       Terry
Mark       Alma

家族树状关系谱：

图4.2-1 家族谱

样例输出如下所示。

file：

Tom          　　Alice
Tom          　　Jesse
Jone          　　Alice
Jone          　　 Jesse
Tom          　　Mary
Tom          　　Ben
Jone          　　 Mary
Jone          　　 Ben
Philip       　　  Alice
Philip          　　Jesse
Mark          　　 Alice
Mark          　　 Jesse

4.2 设计思路
分析这个实例，显然需要进行单表连接，连接的是左表的parent列和右表的child列，且左表和右表是同一个表。

　　连接结果中除去连接的两列就是所需要的结果——"grandchild--grandparent"表。要用MapReduce解决这个实例，首先应该考虑如何实现表的自连接；其次就是连接列的设置；最后是结果的整理。

考虑到MapReduce的shuffle过程会将相同的key会连接在一起，所以可以将map结果的key设置成待连接的列，然后列中相同的值就自然会连接在一起了。再与最开始的分析联系起来：

　　要连接的是左表的parent列和右表的child列，且左表和右表是同一个表，所以在map阶段将读入数据分割成child和parent之后，会将parent设置成key，child设置成value进行输出，并作为左表；再将同一对child和parent中的child设置成key，parent设置成value进行输出，作为右表。为了区分输出中的左右表，需要在输出的value中再加上左右表的信息，比如在value的String最开始处加上字符1表示左表，加上字符2表示右表。这样在map的结果中就形成了左表和右表，然后在shuffle过程中完成连接。reduce接收到连接的结果，其中每个key的value-list就包含了"1grandchild，2grandparent"关系。取出每个key的value-list进行解析，将左表中的child放入一个数组，右表中的parent放入一个数组，然后对两个数组求笛卡尔积就是最后的结果了。

4.3 程序代码
程序代码如下所示。（自己写的，原来的过于麻烦，木有看啊）

　　

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92

package test;

import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class STjoin {

public static class Map extends Mapper<LongWritable, Text, Text, Text>{
      private static Text child = new Text();
      private static Text parent = new Text();
      private static Text tempChild = new Text();
      private static Text tempParent = new Text();

      protected void map(LongWritable key, Text value, Context context)
            throws java.io.IOException ,InterruptedException {
         String[] splits = value.toString().split("\\s+");
         if(splits.length != 2){
            return;
         }
         child.set(splits[0]);
         parent.set(splits[1]);
         tempChild.set("1"+splits[0]);
         tempParent.set("2"+splits[1]);
         context.write(parent, tempChild);
         context.write(child, tempParent);
      };
}
public static class Reduce extends Reducer<Text, Text, Text, Text>{
      private static Text child = new Text();
      private static Text grand = new Text();
      private static List<String> childs = new ArrayList<String>();
      private static List<String> grands = new ArrayList<String>();

      protected void reduce(Text key, Iterable<Text> values, Context context)
            throws java.io.IOException ,InterruptedException {
         // 1child 2 grand
         for (Text value : values) {
            String temp = value.toString();
            if (temp.startsWith("1"))
                  childs.add(temp.substring(1));
            else
                  grands.add(temp.substring(1));
         }
         //笛卡尔积
         for (String c : childs) {
            for(String g : grands){
                  child.set(c);
                  grand.set(g);
                  context.write(child, grand);
            }
         }
         //清理
         childs.clear();
         grands.clear();
      };
}

public static void main(String[] args) throws Exception {
      Configuration conf = new Configuration();
      String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
      if(otherArgs.length != 2){
         System.err.println("Usage:STjoin");
         System.exit(2);
      }
      Job job = new Job(conf, "STjoin");
      job.setJarByClass(STjoin.class);

      job.setMapperClass(Map.class);
      job.setReducerClass(Reduce.class);

      job.setMapOutputKeyClass(Text.class);
      job.setMapOutputValueClass(Text.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(Text.class);

      FileInputFormat.addInputPath(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));

      System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

4.4 准备测试数据，编辑文件上传到hdfs上，设置myeclipse允许参数，运行。
4.5 得到结果。

Tom    Alice
Tom    Jesse
Jone Alice
Jone Jesse
Tom    Mary
Tom    Ben
Jone Mary
Jone Ben
Philip  Alice
Philip  Jesse
Mark Alice
Mark Jesse

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] hadoop 单表关联

扫码加入运维网微信交流群