SOLR: tika with OCR engine

超酷小 · 发表于 2016-12-16 09:17:06

　　I want to parse the content not just the metadata of a jpg picture.
　　The following code is the test class

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.ocr.TesseractOCRParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class JpegParse {
public static void main(final String[] args) throws IOException, SAXException, TikaException, InterruptedException {
File file = new File("/path/to/menu.jpg");
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(file);
ParseContext pcontext = new ParseContext();
TesseractOCRConfig config = new TesseractOCRConfig();
config.setLanguage("chi");
config.setTesseractPath("/path/to/tesseract-ocr");
pcontext.set(TesseractOCRConfig.class, config);
TesseractOCRParser JpegParser = new TesseractOCRParser();
pcontext.set(TesseractOCRParser.class, JpegParser);
JpegParser.parse(inputstream, handler, metadata, pcontext);
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
System.out.println("Contents of the document:" + handler.toString());
}
}

　　Note:
　　config.setTesseractPath("/path/to/tesseract-ocr");
　　must be parent dir includes tessdata dir.
　　And tesseract cmd must be linked in this dir
　　#ln -s /usr/local/bin/tesseract /path/to/tesseract-ocr
　　Preferences
　　https://wiki.apache.org/tika/TikaOCR
　　http://www.kaiyuanba.cn/html/1/131/227/7891.htm

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

Red Hat RHCE 8 (EX294) Cert Guide

c++ size_t 和 int 的区别

HERE 使用 AWS EF 和 JFrog Artifactory 打

C++ 指针大全：从基础到进阶，一篇快速上手

wirelessnetview好用的无线分析工具

亿图图示专家(EDraw Max) V7.9 中文破解版

[经验分享] SOLR: tika with OCR engine

浏览过的版块

扫码加入运维网微信交流群