Spark中使用HanLP分词

花花世界蕾 发表于 2019-1-30 12:59:45

　　1.将HanLP的data(包含词典和模型)放到hdfs上，然后在项目配置文件hanlp.properties中配置root的路径，比如：
　　root=hdfs://localhost:9000/tmp/
　　2.实现com.hankcs.hanlp.corpus.io.IIOAdapter接口：
　　

public static>
　　@Override
　　public InputStream open(String path) throws IOException {
　　Configuration conf = new Configuration();
　　FileSystem fs = FileSystem.get(URI.create(path), conf);
　　return fs.open(new Path(path));
　　}
　　

　　@Override
　　public OutputStream create(String path) throws IOException {
　　Configuration conf = new Configuration();
　　FileSystem fs = FileSystem.get(URI.create(path), conf);
　　OutputStream out = fs.create(new Path(path));
　　return out;
　　}
　　
}
　　

　　3.设置IoAdapter，创建分词器：
　　private static Segment segment;
　　static {
　　HanLP.Config.IOAdapter = new HadoopFileIoAdapter();
　　segment = new CRFSegment();
　　}
　　然后，就可以在Spark的操作中使用segment进行分词了。
　　文章来源于云聪的博客

页: [1]

运维网's Archiver

Spark中使用HanLP分词