public interface EncodingDetector {
/**
* Detects the character encoding of the given text document, or
* null if the encoding of the document can not be detected.
*
* If the document input stream is not available, then the first
* argument may be null. Otherwise the detector may
* read bytes from the start of the stream to help in encoding detection.
* The given stream is guaranteed to support the
* {@link InputStream#markSupported() mark feature} and the detector
* is expected to {@link InputStream#mark(int) mark} the stream before
* reading any bytes from it, and to {@link InputStream#reset() reset}
* the stream before returning. The stream must not be closed by the
* detector.
*
* The given input metadata is only read, not modified, by the detector.
*
* @param input text document input stream, or null
* @param metadata input metadata for the document
* @return detected character encoding, or null
* @throws IOException if the document input stream could not be read
*/
Charset detect(InputStream input, Metadata metadata) throws IOException;
}
编码识别接口EncodingDetector的实现类有三,分别是HtmlEncodingDetector,UniversalEncodingDetector,和Icu4jEncodingDetector
从三者的名称基本可以看出他们的功能或所用的组件,Tika默认的网页编码识别是存在问题的,当解析网页文件时,在网页的html元素里面编码有误的时候就会产生乱码。
HtmlEncodingDetector类的源码如下:
public class HtmlEncodingDetector implements EncodingDetector {
// TIKA-357 - use bigger buffer for meta tag sniffing (was 4K)
private static final int META_TAG_BUFFER_SIZE = 8192;
private static final Pattern HTTP_EQUIV_PATTERN = Pattern.compile(
"(?is) 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
detector.dataEnd();
}
如果我们的程序采用UniversalEncodingDetector类来识别文件编码,代码怎么实现呢?下面是调用方法: