花花世界蕾 发表于 2019-1-30 12:59:45

Spark中使用HanLP分词

  1.将HanLP的data(包含词典和模型)放到hdfs上,然后在项目配置文件hanlp.properties中配置root的路径,比如:
  root=hdfs://localhost:9000/tmp/
  2.实现com.hankcs.hanlp.corpus.io.IIOAdapter接口:
  

public static>
  @Override
  public InputStream open(String path) throws IOException {
  Configuration conf = new Configuration();
  FileSystem fs = FileSystem.get(URI.create(path), conf);
  return fs.open(new Path(path));
  }
  

  @Override
  public OutputStream create(String path) throws IOException {
  Configuration conf = new Configuration();
  FileSystem fs = FileSystem.get(URI.create(path), conf);
  OutputStream out = fs.create(new Path(path));
  return out;
  }
  
}
  

  3.设置IoAdapter,创建分词器:
  private static Segment segment;
  static {
  HanLP.Config.IOAdapter = new HadoopFileIoAdapter();
  segment = new CRFSegment();
  }
  然后,就可以在Spark的操作中使用segment进行分词了。
  文章来源于云聪的博客


页: [1]
查看完整版本: Spark中使用HanLP分词