hadoop倒排索引

zj2092 · 发表于 2015-7-12 09:47:20

　　
　　1.前言
　　学习hadoop的童鞋，倒排索引这个算法还是挺重要的。这是以后展开工作的基础。首先，我们来认识下什么是倒拍索引：
　　倒排索引简单地就是：根据单词，返回它在哪个文件中出现过，而且频率是多少的结果。这就像百度里的搜索，你输入一个关键字，那么百度引擎就迅速的在它的服务器里找到有该关键字的文件，并根据频率和其他一些策略（如页面点击投票率）等来给你返回结果。这个过程中，倒排索引就起到很关键的作用。
　　2.分析设计
　　倒排索引涉及几个过程：Map过程，Combine过程，Reduce过程。下面我们来分析以上的过程。
　　2.1Map过程
　　当你把需要处理的文档上传到hdfs时，首先默认的TextInputFormat类对输入的文件进行处理，得到文件中每一行的偏移量和这一行内容的键值对做为map的输入。在改写map函数的时候，我们就需要考虑，怎么设计key和value的值来适合MapReduce框架，从而得到正确的结果。由于我们要得到单词,所属的文档URL,词频，而只有两个值，那么就必须得合并其中得两个信息了。这里我们设计key=单词＋URL，value=词频。即map得输出为，之所以将单词＋URL做为key，时利用MapReduce框架自带得Map端进行排序。
　　下面举个简单得例子：

　　图1 map过程输入／输出
　　2.2 Combine过程
　　
　　combine过程将key值相同得value值累加，得到一个单词在文档上得词频。但是为了把相同得key交给同一个reduce处理，我们需要设计为key=单词，value＝URL+词频
　　

　　图2 Combin过程输入/输出
　　2.3Reduce过程
　　reduce过程其实就是一个合并的过程了，只需将相同的key值的value值合并成倒排索引需要的格式即可。

　　图3 reduce过程输入/输出
　　3.源代码
　　

package reverseIndex;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class InvertedIndex {
public static class InvertedIndexMapper extends Mapper{
private Text keyInfo=new Text();
private Text valueInfo=new Text();
private FileSplit split;
public void map(Object key,Text value,Context context)throws IOException,InterruptedException {
//获得对所属的对象
split=(FileSplit)context.getInputSplit();
StringTokenizer itr=new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
//key值有单词和url组成，如"mapreduce:1.txt"
keyInfo.set(itr.nextToken()+":"+split.getPath().toString());
valueInfo.set("1");
context.write(keyInfo, valueInfo);
}
}
}
public static class InvertedIndexCombiner extends Reducer{
private Text info=new Text();
public void reduce(Text key,Iterable values,Context context)throws IOException,InterruptedException {
//统计词频
int sum=0;
for (Text value:values) {
sum+=Integer.parseInt(value.toString());
}
int splitIndex=key.toString().indexOf(":");
//重新设置value值由url和词频组成
info.set(key.toString().substring(splitIndex+1)+":"+sum);
//重新设置key值为单词
key.set(key.toString().substring(0,splitIndex));
context.write(key, info);
}
}
public static class InvertedIndexReduce extends Reducer {
private Text result=new Text();
public void reduce(Text key,Iterablevalues,Context context) throws IOException,InterruptedException{
//生成文档列表
String fileList=new String();
for (Text value:values) {
fileList+=value.toString()+";";
}
result.set(fileList);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
Configuration conf=new Configuration();
String[] otherArgs=new GenericOptionsParser(conf,args).getRemainingArgs();
if (otherArgs.length!=2) {
System.err.println("Usage:invertedindex");
System.exit(2);
}
Job job=new Job(conf,"InvertedIndex");
job.setJarByClass(InvertedIndex.class);
job.setMapperClass(InvertedIndexMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setCombinerClass(InvertedIndexCombiner.class);
job.setReducerClass(InvertedIndexReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
View Code

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] hadoop倒排索引

浏览过的版块

扫码加入运维网微信交流群