Apache Tika源码研究(四)
上文分析了具体的解析类HtmlParser对网页文档的解析实现源码,了解到了Apache Tika的编码识别的处理方式。(HtmlParser对网页文件的解析其实并没有用到ParseContext上下文类的SAXParser对象,而是用到了另外一个TagSoup组件)
本文继续分析Tika对xml格式文件SAX解析的事件处理相关类,精彩的部分留在后头吧
jaxp规范定义了四个事件处理接口,分别是EntityResolver, DTDHandler, ContentHandler, ErrorHandler
并提供了一个默认处理类DefaultHandler,实现了上面四个接口,这为我们扩展SAX的事件处理类提供了方便,只要继承该类即可。
Apache Tika提供的事件处理类使用了装饰模式,里面的包装关系一层套一层,实在是看得眼花缭乱,下面的解析部分只对部分类解析,其他事件处理类类似,不再赘述。
先来看看关键类的UML模型
ContentHandlerDecorator类继承自JAXP的默认处理类DefaultHandler,从名称基本可以看出该类采用了装饰模式,下面是它的源码:
/**
* Decorator base class for the {@link ContentHandler} interface. This class
* simply delegates all SAX events calls to an underlying decorated handler
* instance. Subclasses can provide extra decoration by overriding one or more
* of the SAX event methods.
*/
public class ContentHandlerDecorator extends DefaultHandler {
/**
* Decorated SAX event handler.
*/
private ContentHandler handler;
/**
* Creates a decorator for the given SAX event handler.
*
* @param handler SAX event handler to be decorated
*/
public ContentHandlerDecorator(ContentHandler handler) {
assert handler != null;
this.handler = handler;
}
/**
* Creates a decorator that by default forwards incoming SAX events to
* a dummy content handler that simply ignores all the events. Subclasses
* should use the {@link #setContentHandler(ContentHandler)} method to
* switch to a more usable underlying content handler.
*/
protected ContentHandlerDecorator() {
this(new DefaultHandler());
}
/**
* Sets the underlying content handler. All future SAX events will be
* directed to this handler instead of the one that was previously used.
*
* @param handler content handler
*/
protected void setContentHandler(ContentHandler handler) {
assert handler != null;
this.handler = handler;
}
@Override
public void startPrefixMapping(String prefix, String uri)
throws SAXException {
try {
handler.startPrefixMapping(prefix, uri);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void endPrefixMapping(String prefix) throws SAXException {
try {
handler.endPrefixMapping(prefix);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void processingInstruction(String target, String data)
throws SAXException {
try {
handler.processingInstruction(target, data);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void setDocumentLocator(Locator locator) {
handler.setDocumentLocator(locator);
}
@Override
public void startDocument() throws SAXException {
try {
handler.startDocument();
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void endDocument() throws SAXException {
try {
handler.endDocument();
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void startElement(
String uri, String localName, String name, Attributes atts)
throws SAXException {
try {
handler.startElement(uri, localName, name, atts);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void endElement(String uri, String localName, String name)
throws SAXException {
try {
handler.endElement(uri, localName, name);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
try {
handler.characters(ch, start, length);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
try {
handler.ignorableWhitespace(ch, start, length);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public void skippedEntity(String name) throws SAXException {
try {
handler.skippedEntity(name);
} catch (SAXException e) {
handleException(e);
}
}
@Override
public String toString() {
return handler.toString();
}
/**
* Handle any exceptions thrown by methods in this class. This method
* provides a single place to implement custom exception handling. The
* default behaviour is simply to re-throw the given exception, but
* subclasses can also provide alternative ways of handling the situation.
*
* @param exception the exception that was thrown
* @throws SAXException the exception (if any) thrown to the client
*/
protected void handleException(SAXException exception) throws SAXException {
throw exception;
}
}
该装饰类持有ContentHandler对象的引用,其后相关的方法都是调用了ContentHandler的同名方法
接下来看具体的装饰类BodyContentHandler的源码
/**
* Content handler decorator that only passes everything inside
* the XHTML <body/> tag to the underlying handler. Note that
* the <body/> tag itself is not passed on.
*/
public class BodyContentHandler extends ContentHandlerDecorator {
/**
* XHTML XPath parser.
*/
private static final XPathParser PARSER =
new XPathParser("xhtml", XHTMLContentHandler.XHTML);
/**
* The XPath matcher used to select the XHTML body contents.
*/
private static final Matcher MATCHER =
PARSER.parse("/xhtml:html/xhtml:body/descendant::node()");
/**
* Creates a content handler that passes all XHTML body events to the
* given underlying content handler.
*
* @param handler content handler
*/
public BodyContentHandler(ContentHandler handler) {
super(new MatchingContentHandler(handler, MATCHER));
}
/**
* Creates a content handler that writes XHTML body character events to
* the given writer.
*
* @param writer writer
*/
public BodyContentHandler(Writer writer) {
this(new WriteOutContentHandler(writer));
}
/**
* Creates a content handler that writes XHTML body character events to
* the given output stream using the default encoding.
*
* @param stream output stream
*/
public BodyContentHandler(OutputStream stream) {
this(new WriteOutContentHandler(stream));
}
/**
* Creates a content handler that writes XHTML body character events to
* an internal string buffer. The contents of the buffer can be retrieved
* using the {@link #toString()} method.
*
* The internal string buffer is bounded at the given number of characters.
* If this write limit is reached, then a {@link SAXException} is thrown.
*
* @since Apache Tika 0.7
* @param writeLimit maximum number of characters to include in the string,
* or -1 to disable the write limit
*/
public BodyContentHandler(int writeLimit) {
this(new WriteOutContentHandler(writeLimit));
}
/**
* Creates a content handler that writes XHTML body character events to
* an internal string buffer. The contents of the buffer can be retrieved
* using the {@link #toString()} method.
*
* The internal string buffer is bounded at 100k characters. If this write
* limit is reached, then a {@link SAXException} is thrown.
*/
public BodyContentHandler() {
this(new WriteOutContentHandler());
}
}
最后是用过调用父类的构造函数初始化被装饰的对象
页:
[1]