Hadoop上的中文分词与词频统计实践

cnn · 发表于 2015-7-11 08:28:37

　　首先来推荐相关材料：http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-count-on-hadoop/。小虾的这个统计武侠小说人名热度的段子很有意思，照虎画猫来实践一下。
　　与其不同的地方有：
　　0）其使用Hadoop Streaming，这里使用MapReduce框架。
　　1）不同的中文分词方法，这里使用IKAnalyzer，主页在http://code.google.com/p/ik-analyzer/。
　　2）这里的材料为《射雕英雄传》。哈哈，总要来一些改变。
　　
　　0）使用WordCount源代码，修改其Map，在Map中使用IKAnalyzer的分词功能。

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.ByteArrayInputStream;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class ChineseWordCount {
public static class TokenizerMapper
extends Mapper{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
byte[] bt = value.getBytes();
InputStream ip = new ByteArrayInputStream(bt);
Reader read = new InputStreamReader(ip);
IKSegmenter iks = new IKSegmenter(read,true);
Lexeme t;
while ((t = iks.next()) != null)
{
word.set(t.getLexemeText());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount  ");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(ChineseWordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
　　1）So，完成了，本地插件模拟环境OK。打包（带上分词包）扔到集群上。

hadoop fs -put chinese_in.txt chinese_in.txt
hadoop jar WordCount.jar chinese_in.txt out0
...mapping reducing...
hadoop fs -ls ./out0
hadoop fs -get part-r-00000 words.txt
　　2）数据后处理：
　　2.1）数据排序

head words.txt
tail words.txt

sort -k2 words.txt >0.txt
head 0.txt
tail 0.txt
sort -k2r words.txt>0.txt
head 0.txt
tail 0.txt
sort -k2rn words.txt>0.txt
head -n 50 0.txt
　　2.2）目标提取

awk '{if(length($1)>=2) print $0}' 0.txt >1.txt
　　2.3）结果呈现

head 1.txt -n 50 | sed = | sed 'N;s/\n//'

1郭靖 6427
2黄蓉 4621
3欧阳 1660
4甚么 1430
5说道 1287
6洪七公 1225
7笑道 1214
8自己 1193
9一个 1160
10师父  1080
11黄药师       1059
12心中  1046
13两人  1016
14武功  950
15咱们  925
16一声  912
17只见  827
18他们  782
19心想  780
20周伯通       771
21功夫  758
22不知  755
23欧阳克       752
24听得  741
25丘处机       732
26当下  668
27爹爹  664
28只是  657
29知道  654
30这时  639
31之中  621
32梅超风       586
33身子  552
34都是  540
35不是  534
36如此  531
37柯镇恶       528
38到了  523
39不敢  522
40裘千仞       521
41杨康  520
42你们  509
43这一  495
44却是  478
45众人  476
46二人  475
47铁木真       469
48怎么  464
49左手  452
50地下  448
　　在非人名词中有很多很有意思，如：5说道7笑道12心中17只见22不知30这时49左手。
　　

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[经验分享] Hadoop上的中文分词与词频统计实践

扫码加入运维网微信交流群