设为首页 收藏本站
查看: 1351|回复: 0

[经验分享] Apache Tika源码研究(四)

[复制链接]

尚未签到

发表于 2015-8-1 13:42:30 | 显示全部楼层 |阅读模式
  上文分析了具体的解析类HtmlParser对网页文档的解析实现源码,了解到了Apache Tika的编码识别的处理方式。
  (HtmlParser对网页文件的解析其实并没有用到ParseContext上下文类的SAXParser对象,而是用到了另外一个TagSoup组件)
  本文继续分析Tika对xml格式文件SAX解析的事件处理相关类,精彩的部分留在后头吧
  jaxp规范定义了四个事件处理接口,分别是EntityResolver, DTDHandler, ContentHandler, ErrorHandler
  并提供了一个默认处理类DefaultHandler,实现了上面四个接口,这为我们扩展SAX的事件处理类提供了方便,只要继承该类即可。
  Apache Tika提供的事件处理类使用了装饰模式,里面的包装关系一层套一层,实在是看得眼花缭乱,下面的解析部分只对部分类解析,其他事件处理类类似,不再赘述。
  先来看看关键类的UML模型
   DSC0000.png
  ContentHandlerDecorator类继承自JAXP的默认处理类DefaultHandler,从名称基本可以看出该类采用了装饰模式,下面是它的源码:



/**
* Decorator base class for the {@link ContentHandler} interface. This class
* simply delegates all SAX events calls to an underlying decorated handler
* instance. Subclasses can provide extra decoration by overriding one or more
* of the SAX event methods.
*/
public class ContentHandlerDecorator extends DefaultHandler {
/**
* Decorated SAX event handler.
*/
private ContentHandler handler;
/**
* Creates a decorator for the given SAX event handler.
*
* @param handler SAX event handler to be decorated
*/
public ContentHandlerDecorator(ContentHandler handler) {
assert handler != null;
this.handler = handler;
}
/**
* Creates a decorator that by default forwards incoming SAX events to
* a dummy content handler that simply ignores all the events. Subclasses
* should use the {@link #setContentHandler(ContentHandler)} method to
* switch to a more usable underlying content handler.
*/
protected ContentHandlerDecorator() {
this(new DefaultHandler());
}
/**
* Sets the underlying content handler. All future SAX events will be
* directed to this handler instead of the one that was previously used.
*
* @param handler content handler
*/
protected void setContentHandler(ContentHandler handler) {
assert handler != null;
this.handler = handler;
}
@Override
public void startPrefixMapping(String prefix, String uri)
throws SAXException {
try {
handler.startPrefixMapping(prefix, uri);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void endPrefixMapping(String prefix) throws SAXException {
try {
handler.endPrefixMapping(prefix);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void processingInstruction(String target, String data)
throws SAXException {
try {
handler.processingInstruction(target, data);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void setDocumentLocator(Locator locator) {
handler.setDocumentLocator(locator);
}
@Override
public void startDocument() throws SAXException {
try {
handler.startDocument();
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void endDocument() throws SAXException {
try {
handler.endDocument();
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void startElement(
String uri, String localName, String name, Attributes atts)
throws SAXException {
try {
handler.startElement(uri, localName, name, atts);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void endElement(String uri, String localName, String name)
throws SAXException {
try {
handler.endElement(uri, localName, name);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
try {
handler.characters(ch, start, length);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
try {
handler.ignorableWhitespace(ch, start, length);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void skippedEntity(String name) throws SAXException {
try {
handler.skippedEntity(name);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public String toString() {
return handler.toString();
}
/**
* Handle any exceptions thrown by methods in this class. This method
* provides a single place to implement custom exception handling. The
* default behaviour is simply to re-throw the given exception, but
* subclasses can also provide alternative ways of handling the situation.
*
* @param exception the exception that was thrown
* @throws SAXException the exception (if any) thrown to the client
*/
protected void handleException(SAXException exception) throws SAXException {
throw exception;
}
}
  该装饰类持有ContentHandler对象的引用,其后相关的方法都是调用了ContentHandler的同名方法
  接下来看具体的装饰类BodyContentHandler的源码



/**
* Content handler decorator that only passes everything inside
* the XHTML <body/> tag to the underlying handler. Note that
* the <body/> tag itself is not passed on.
*/
public class BodyContentHandler extends ContentHandlerDecorator {
/**
* XHTML XPath parser.
*/
private static final XPathParser PARSER =
new XPathParser("xhtml", XHTMLContentHandler.XHTML);
/**
* The XPath matcher used to select the XHTML body contents.
*/
private static final Matcher MATCHER =
PARSER.parse("/xhtml:html/xhtml:body/descendant::node()");
/**
* Creates a content handler that passes all XHTML body events to the
* given underlying content handler.
*
* @param handler content handler
*/
public BodyContentHandler(ContentHandler handler) {
super(new MatchingContentHandler(handler, MATCHER));
}
/**
* Creates a content handler that writes XHTML body character events to
* the given writer.
*
* @param writer writer
*/
public BodyContentHandler(Writer writer) {
this(new WriteOutContentHandler(writer));
}
/**
* Creates a content handler that writes XHTML body character events to
* the given output stream using the default encoding.
*
* @param stream output stream
*/
public BodyContentHandler(OutputStream stream) {
this(new WriteOutContentHandler(stream));
}
/**
* Creates a content handler that writes XHTML body character events to
* an internal string buffer. The contents of the buffer can be retrieved
* using the {@link #toString()} method.
*
* The internal string buffer is bounded at the given number of characters.
* If this write limit is reached, then a {@link SAXException} is thrown.
*
* @since Apache Tika 0.7
* @param writeLimit maximum number of characters to include in the string,
*                   or -1 to disable the write limit
*/
public BodyContentHandler(int writeLimit) {
this(new WriteOutContentHandler(writeLimit));
}
/**
* Creates a content handler that writes XHTML body character events to
* an internal string buffer. The contents of the buffer can be retrieved
* using the {@link #toString()} method.
*
* The internal string buffer is bounded at 100k characters. If this write
* limit is reached, then a {@link SAXException} is thrown.
*/
public BodyContentHandler() {
this(new WriteOutContentHandler());
}
}
  最后是用过调用父类的构造函数初始化被装饰的对象
  

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-93054-1-1.html 上篇帖子: Apache 相关 mod_rewrite ,RewriteCond,{HTTP_HOST} 下篇帖子: 简单的apache URL重写
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表