设为首页 收藏本站
查看: 768|回复: 0

[经验分享] Apache Tika源码研究(六)

[复制链接]

尚未签到

发表于 2015-8-1 11:49:54 | 显示全部楼层 |阅读模式
  上文还没有来得及分析Apache Tika是怎样检测文档的mime类型的,以及怎样根据mime类型找到相应的Parser解析类的,下面接着说
  在tika-parsers.jar路径文件META-INF/services/org.apache.tika.detect.Detector记录了tika提供的mime类型检测类,当然tika还有部分mime类型检测类该文件并没有记录,后面我通过分析源码可以获知。
  该文件包含的检测类我们先睹为快:



#  Licensed to the Apache Software Foundation (ASF) under one or more
#  contributor license agreements.  See the NOTICE file distributed with
#  this work for additional information regarding copyright ownership.
#  The ASF licenses this file to You under the Apache License, Version 2.0
#  (the "License"); you may not use this file except in compliance with
#  the License.  You may obtain a copy of the License at
#
#       http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
org.apache.tika.parser.microsoft.POIFSContainerDetector
org.apache.tika.parser.pkg.ZipContainerDetector
  注意还有vorbis-java-tika-X.jar的同名路径下也存在该文件,tika都会加载进来,所以共包含了三个实现类
  org.apache.tika.parser.microsoft.POIFSContainerDetector
  org.apache.tika.parser.pkg.ZipContainerDetector
  org.gagravarr.tika.OggDetector
  这些tika文档mime类型检测类共同实现了Detector接口:
  最重要的文件的mime类型检测相关接口和类的UML图如下:
DSC0000.png
  Detector接口源码:



/**
* Content type detector. Implementations of this interface use various
* heuristics to detect the content type of a document based on given
* input metadata or the first few bytes of the document stream.
*
* @since Apache Tika 0.3
*/
public interface Detector extends Serializable {
/**
* Detects the content type of the given input document. Returns
* application/octet-stream if the type of the document
* can not be detected.
*
* If the document input stream is not available, then the first
* argument may be null. Otherwise the detector may
* read bytes from the start of the stream to help in type detection.
* The given stream is guaranteed to support the
* {@link InputStream#markSupported() mark feature} and the detector
* is expected to {@link InputStream#mark(int) mark} the stream before
* reading any bytes from it, and to {@link InputStream#reset() reset}
* the stream before returning. The stream must not be closed by the
* detector.
*
* The given input metadata is only read, not modified, by the detector.
*
* @param input document input stream, or null
* @param metadata input metadata for the document
* @return detected media type, or application/octet-stream
* @throws IOException if the document input stream could not be read
*/
MediaType detect(InputStream input, Metadata metadata) throws IOException;
}
  
  实现该接口的最重要的类是CompositeDetector,该类并不提供具体的mime类型检测,而是调用其他的实现类进行mime类型检测,供tika其他类调用



/**
* Content type detector that combines multiple different detection mechanisms.
*/
public class CompositeDetector implements Detector {
/**
* Serial version UID
*/
private static final long serialVersionUID = 5980683158436430252L;
private final MediaTypeRegistry registry;
private final List detectors;
public CompositeDetector(
MediaTypeRegistry registry, List detectors) {
this.registry = registry;
this.detectors = detectors;
}
public CompositeDetector(List detectors) {
this(new MediaTypeRegistry(), detectors);
}
public CompositeDetector(Detector... detectors) {
this(Arrays.asList(detectors));
}
public MediaType detect(InputStream input, Metadata metadata)
throws IOException {
MediaType type = MediaType.OCTET_STREAM;
for (Detector detector : getDetectors()) {
MediaType detected = detector.detect(input, metadata);
if (registry.isSpecializationOf(detected, type)) {
type = detected;
}
}
return type;
}
/**
* Returns the component detectors.
*/
public List getDetectors() {
return Collections.unmodifiableList(detectors);
}
}
  
  构造函数CompositeDetector(MediaTypeRegistry registry, List detectors)用于初始化成员变量MediaTypeRegistry registry和List detectors
  MediaTypeRegistry registry成员注册了系统提供的mime类型,List detectors成员为系统的Detector实现类集合
  MediaType detect(InputStream input, Metadata metadata)方法遍历Detector集合检测InputStream input的mime类型
  CompositeDetector还有一个派生类DefaultDetector,用于初始化CompositeDetector的成员变量



public class DefaultDetector extends CompositeDetector {
/** Serial version UID */
private static final long serialVersionUID = -8170114575326908027L;
/**
* Finds all statically loadable detectors and sort the list by name,
* rather than discovery order. Detectors are used in the given order,
* so put the Tika parsers last so that non-Tika (user supplied)
* parsers can take precedence.
*
* @param loader service loader
* @return ordered list of statically loadable detectors
*/
private static List getDefaultDetectors(
MimeTypes types, ServiceLoader loader) {
List detectors =
loader.loadStaticServiceProviders(Detector.class);
Collections.sort(detectors, new Comparator() {
public int compare(Detector d1, Detector d2) {
String n1 = d1.getClass().getName();
String n2 = d2.getClass().getName();
boolean t1 = n1.startsWith("org.apache.tika.");
boolean t2 = n2.startsWith("org.apache.tika.");
if (t1 == t2) {
return n1.compareTo(n2);
} else if (t1) {
return 1;
} else {
return -1;
}
}
});
// Finally the Tika MimeTypes as a fallback
        detectors.add(types);
return detectors;
}
private transient final ServiceLoader loader;
public DefaultDetector(MimeTypes types, ServiceLoader loader) {
super(types.getMediaTypeRegistry(), getDefaultDetectors(types, loader));
this.loader = loader;
}
public DefaultDetector(MimeTypes types, ClassLoader loader) {
this(types, new ServiceLoader(loader));
}
public DefaultDetector(ClassLoader loader) {
this(MimeTypes.getDefaultMimeTypes(), loader);
}
public DefaultDetector(MimeTypes types) {
this(types, new ServiceLoader());
}
public DefaultDetector() {
this(MimeTypes.getDefaultMimeTypes());
}
@Override
public List getDetectors() {
if (loader != null) {
List detectors =
loader.loadDynamicServiceProviders(Detector.class);
detectors.addAll(super.getDetectors());
return detectors;
} else {
return super.getDetectors();
}
}
}
  

List getDefaultDetectors(MimeTypes types, ServiceLoader loader)方法加载静态的Detector实现类,而List getDetectors()方法加载动态的Detector实现类并包含父类的Detector实现类集合
我们这里注意到,前者额外调用了detectors.add(types),将MimeTypes types对象也添加到集合里面,因为MimeTypes类是实现了Detector接口的,前面文章我已经提到过。
所以实际用到的解析类包括四个
  org.apache.tika.parser.microsoft.POIFSContainerDetector
  org.apache.tika.parser.pkg.ZipContainerDetector
  org.gagravarr.tika.OggDetector
  org.apache.tika.mime.MimeTypes


现在我们该如何调用呢,


public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
ServiceLoader loader = new ServiceLoader();        
MimeTypes mimeTypes = MimeTypes.getDefaultMimeTypes();        
Detector detector=new DefaultDetector(mimeTypes, loader);      
File file=new File("[文件路径]");
InputStream stream = null;
try
{
stream=new BufferedInputStream(new FileInputStream(file));            
MediaType type =detector.detect(stream, new Metadata());
System.out.println("mime类型:"+type.toString());
}
finally
{
if (stream != null)    stream.close();
}
}
  现在还有tika怎样加载Parser实现类的,怎样根据文档的mime类型调用相应的Parser实现类的还没有进行分析,不过这些都相对容易分析了,下文再继续吧。

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-92979-1-1.html 上篇帖子: apache php 开启伪静态 下篇帖子: apache note
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表