hadoop 分布式缓存

zzbb · 发表于 2015-7-13 09:23:08

　　Hadoop 分布式缓存实现目的是在所有的MapReduce调用一个统一的配置文件，首先将缓存文件放置在HDFS中，然后程序在执行的过程中会可以通过设定将文件下载到本地具体设定如下：
　　public static void main(String[] arge) throws IOException, ClassNotFoundException, InterruptedException{

      Configuration conf=new Configuration();
      conf.set("fs.default.name", "hdfs://192.168.1.45:9000");
      FileSystem fs=FileSystem.get(conf);
      fs.delete(new Path("CASICJNJP/gongda/Test_gd20140104"));

      conf.set("mapred.job.tracker", "192.168.1.45:9001");
      conf.set("mapred.jar", "/home/hadoop/workspace/jar/OBDDataSelectWithImeiTxt.jar");
      Job job=new Job(conf,"myTaxiAnalyze");


   DistributedCache.createSymlink(job.getConfiguration());//
      try {
         DistributedCache.addCacheFile(new URI("/user/hadoop/CASICJNJP/DistributeFiles/imei.txt"), job.getConfiguration());
      } catch (URISyntaxException e1) {
         // TODO Auto-generated catch block
         e1.printStackTrace();
      }
      job.setMapperClass(OBDDataSelectMaper.class);
      job.setReducerClass(OBDDataSelectReducer.class);
      //job.setNumReduceTasks(10);
      //job.setCombinerClass(IntSumReducer.class);
      job.setMapOutputKeyClass(Text.class);
      job.setMapOutputValueClass(Text.class);

      FileInputFormat.addInputPath(job, new Path("/user/hadoop/CASICJNJP/SortedData/20140104"));
      FileOutputFormat.setOutputPath(job, new Path("CASICJNJP/gongda/SelectedData"));

      System.exit(job.waitForCompletion(true)?0:1);

}
　　代码中标红的为将HDFS中的/user/hadoop/CASICJNJP/DistributeFiles/imei.txt作为分布式缓存
　　
　　public class OBDDataSelectMaper extends Mapper {
String[] strs;
String[] ImeiTimes;
String timei;
String time;
private java.util.List ImeiList = new java.util.ArrayList();
protected void setup(Context context) throws IOException,
         InterruptedException {
      try {
         Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context
                  .getConfiguration());
         if (cacheFiles != null && cacheFiles.length > 0) {
            String line;
            BufferedReader br = new BufferedReader(new FileReader(
                     cacheFiles[0].toString()));
            try {
                  line = br.readLine();
                  while ((line = br.readLine()) != null) {
                     ImeiList.add(Integer.parseInt(line));
                  }
            } finally {
                  br.close();
            }
         }
      } catch (IOException e) {
         System.err.println("Exception reading DistributedCache: " + e);
      }
}
public void map(Object key, Text value, Context context)
         throws IOException, InterruptedException {
      try {
         strs = value.toString().split("\t");
         ImeiTimes = strs[0].split("_");
         timei = ImeiTimes[0];
         if (ImeiList.contains(Integer.parseInt(timei))) {
            context.write(new Text(strs[0]), value);
         }
      } catch (Exception ex) {
      }
}
}
　　上述标红代码中在Map的setup函数中加载分布式缓存。

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] hadoop 分布式缓存

浏览过的版块

扫码加入运维网微信交流群