|
本文主要分析tika的语言检测以及tika解决随机访问读取的问题,由于语言检测功能的实现设计一些算法,我这里就不贴出tika的源码了
tika的语言检测的相关接口和类的uml模型图如下
如果要获取文档内容和语言,我们可以新增DelegatingParser解析类,继承自DelegatingParser,代码如下:
public class LanguageDetectingParser extends DelegatingParser {
/**
*
*/
private static final long serialVersionUID = 1L;
public void parse(
InputStream stream, ContentHandler handler,
final Metadata metadata, ParseContext context)
throws SAXException, IOException, TikaException {
ProfilingHandler profiler =new ProfilingHandler();
ContentHandler tee =new TeeContentHandler(handler, profiler);
super.parse(stream, tee, metadata, context);
LanguageIdentifier identifier = profiler.getLanguage();
if (identifier.isReasonablyCertain()) {
metadata.set(Metadata.LANGUAGE,
identifier.getLanguage());
}
}
protected Parser getDelegateParser(ParseContext context) {
return context.get(Parser.class, new AutoDetectParser());
}
}
关于tika里面InputStream输入流随机访问的封装,我们可以看到AutoDetectParser类的parser方法里面的TikaInputStream类
public void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException {
TemporaryResources tmp = new TemporaryResources();
try {
TikaInputStream tis = TikaInputStream.get(stream, tmp);
// Automatically detect the MIME type of the document
MediaType type = detector.detect(tis, metadata);
metadata.set(Metadata.CONTENT_TYPE, type.toString());
// TIKA-216: Zip bomb prevention
SecureContentHandler sch = new SecureContentHandler(handler, tis);
try {
// Parse the document
super.parse(tis, sch, metadata, context);
} catch (SAXException e) {
// Convert zip bomb exceptions to TikaExceptions
sch.throwIfCauseOf(e);
throw e;
}
} finally {
tmp.dispose();
}
}
它的具体机制是将InputStream处理到临时文件,这里我不再贴出其源码
因为有时我们需要InputStream重复使用,这里tika封装对其进行了封装TikaInputStream类,典型的应用场景如我们先要根据InputStream检测文档的编码类型,然后还要进一步对该InputStream进行解析
public static void main(String[] args) throws IOException, TikaException {
// TODO Auto-generated method stub
File file=new File("E:\\watiao.htm");
InputStream stream=TikaInputStream.get(file);
try
{
EncodingDetector detector=new UniversalEncodingDetector();
Charset charset = detector.detect(stream, new Metadata());
System.out.println("编码2:"+charset.name());
//进一步解析
}
finally
{
if (stream != null) stream.close();
}
}
本系列tika源码解析的文章系本人原创,本人参考了《Tika in Action》英文版,以后如有心得再继续补充。
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.iyunv.com/chenying99/archive/2013/03/11/2953365.html |
|