I want to parse the content not just the metadata of a jpg picture.
The following code is the test class
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.ocr.TesseractOCRParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class JpegParse {
public static void main(final String[] args) throws IOException, SAXException, TikaException, InterruptedException {
File file = new File("/path/to/menu.jpg");
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(file);
ParseContext pcontext = new ParseContext();
TesseractOCRConfig config = new TesseractOCRConfig();
config.setLanguage("chi");
config.setTesseractPath("/path/to/tesseract-ocr");
pcontext.set(TesseractOCRConfig.class, config);
TesseractOCRParser JpegParser = new TesseractOCRParser();
pcontext.set(TesseractOCRParser.class, JpegParser);
JpegParser.parse(inputstream, handler, metadata, pcontext);
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
System.out.println("Contents of the document:" + handler.toString());
}
}
Note: config.setTesseractPath("/path/to/tesseract-ocr");
must be parent dir includes tessdata dir.
And tesseract cmd must be linked in this dir
#ln -s /usr/local/bin/tesseract /path/to/tesseract-ocr
Preferences
https://wiki.apache.org/tika/TikaOCR
http://www.kaiyuanba.cn/html/1/131/227/7891.htm