Apache Tika源码研究（八）

佘小宝的爹 · 发表于 2015-8-5 11:24:22

　　本文主要分析tika的语言检测以及tika解决随机访问读取的问题，由于语言检测功能的实现设计一些算法，我这里就不贴出tika的源码了
　　tika的语言检测的相关接口和类的uml模型图如下

　　如果要获取文档内容和语言，我们可以新增DelegatingParser解析类，继承自DelegatingParser，代码如下：

public class LanguageDetectingParser extends DelegatingParser {
/**
*
*/
private static final long serialVersionUID = 1L;
public void parse(
InputStream stream, ContentHandler handler,
final Metadata metadata, ParseContext context)
throws SAXException, IOException, TikaException {
ProfilingHandler profiler =new ProfilingHandler();
ContentHandler tee =new TeeContentHandler(handler, profiler);
super.parse(stream, tee, metadata, context);
LanguageIdentifier identifier = profiler.getLanguage();
if (identifier.isReasonablyCertain()) {
metadata.set(Metadata.LANGUAGE,
identifier.getLanguage());
}
}
protected Parser getDelegateParser(ParseContext context) {
return context.get(Parser.class, new AutoDetectParser());
}
}
　　关于tika里面InputStream输入流随机访问的封装，我们可以看到AutoDetectParser类的parser方法里面的TikaInputStream类

public void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException {
TemporaryResources tmp = new TemporaryResources();
try {
TikaInputStream tis = TikaInputStream.get(stream, tmp);
// Automatically detect the MIME type of the document
MediaType type = detector.detect(tis, metadata);
metadata.set(Metadata.CONTENT_TYPE, type.toString());
// TIKA-216: Zip bomb prevention
SecureContentHandler sch = new SecureContentHandler(handler, tis);
try {
// Parse the document
super.parse(tis, sch, metadata, context);
} catch (SAXException e) {
// Convert zip bomb exceptions to TikaExceptions
            sch.throwIfCauseOf(e);
throw e;
}
} finally {
tmp.dispose();
}
}
　　它的具体机制是将InputStream处理到临时文件，这里我不再贴出其源码
　　因为有时我们需要InputStream重复使用，这里tika封装对其进行了封装TikaInputStream类，典型的应用场景如我们先要根据InputStream检测文档的编码类型，然后还要进一步对该InputStream进行解析

public static void main(String[] args) throws IOException, TikaException {
// TODO Auto-generated method stub
File file=new File("E:\\watiao.htm");
InputStream stream=TikaInputStream.get(file);
try
{
EncodingDetector  detector=new UniversalEncodingDetector();
Charset charset = detector.detect(stream, new Metadata());
System.out.println("编码2："+charset.name());
//进一步解析
      }
finally
{
if (stream != null) stream.close();
}
}
　　本系列tika源码解析的文章系本人原创，本人参考了《Tika in Action》英文版，以后如有心得再继续补充。
　　转载请注明出处博客园刺猬的温驯
　　本文链接 http://www.iyunv.com/chenying99/archive/2013/03/11/2953365.html

账号		自动登录	找回密码
密码			立即注册

c++ size_t 和 int 的区别

HERE 使用 AWS EF 和 JFrog Artifactory 打

C++ 指针大全：从基础到进阶，一篇快速上手

wirelessnetview好用的无线分析工具

Red Hat RHCE 8 (EX294) Cert Guide

亿图图示专家(EDraw Max) V7.9 中文破解版

zabbix3.4.1安装部署+微信推送信息+大屏显

[经验分享] Apache Tika源码研究（八）

浏览过的版块

扫码加入运维网微信交流群