使用 apache tika Parser 对象解析 .doc 和 .docx 文件格式的问题

问题描述:

当我尝试使用 org.apache.tika.parser.ParserDefaultDetector() 来检测和解析 .doc 和 .docx 文件格式时.但是我收到了一些从 Tika jars 抛出的错误(不是异常),并且没有任何有用的堆栈跟踪可供我放在这里.我可以确认它只发生在 .doc 和 .docx 上.PDF、jpeg、文本都可以.有没有人遇到过 .doc 和 .docx 文件格式的问题?有没有你采用的解决方案?

When I try to use org.apache.tika.parser.Parser and DefaultDetector() to detect and parse the .doc and .docx file formats. But I am getting some error (not exception) thrown from Tika jars and that doesn't have any helpful stack trace for me to put here. I can confirm that it is happening for .doc and .docx only. PDF, jpeg, texts are fine. Has anyone come across this problem with .doc and .docx file formats? is there any solution that you have adopted?

我的代码如下:

unzippedBytes = loadUnzippedByteCode(attachment.getContents()); /* This is utility method written using native Java Zip library - returns byte array byte[] */

            /* All the objects below were declared beforehand, but not initialised until now */

            parseContextObj = new ParseContext();
            dObj = new DefaultDetector();
            detectedParser = new AutoDetectParser(dObj);
            context.set(Parser.class, parser);
            OutputStream outputstream = new ByteArrayOutputStream();
            metadata = new Metadata();

            InputStream input = TikaInputStream.get(unzippedBytes, metadata);
            ContentHandler handler = new BodyContentHandler(outputstream);
            detectedParser.parse(input, handler, metadata, parseContextObj); // This is where it is throwing NoSuchMethodError - cannot understand why and also cannot get the stacktrace - using tika 1.10 */ 
            input.close();

上面的代码是我在其他一些 SO 问题中也发现的,并决定将其用于我的工作.此外,我使用的 byte[] 是我从非常旧的 struts 1.0 FormFile 接口(返回 byte[] 的 getFileData())接收的东西.我曾经使用扩音器的 irex 解析器来解析,但出于多种原因决定使用 Tika.byte[] 与 irex 配合良好,但每当我尝试解析 .docx 和 .doc 内容时都会出现问题.

The code above was something that I also found in some other SO question and decided to use it for my work. Also, the byte[] that I have used is something that I am receiving from very old struts 1.0 FormFile interface (getFileData() that returns byte[]). I used to have the bullhorn's irex parser to parse, but decided to use Tika for numerous reasons. the byte[] works fine with irex, but has issues whenever I am trying to parse .docx and .doc contents.

以下是我出于隐私原因屏蔽了某些部分的堆栈跟踪:

The following is the stack trace which I masked certain parts of due to privacy reasons:

2016-01-15 16:21:06,947 [http-apr-80-exec-3] [ERROR] XXXXX.XXXX.XXXXService - java.lang.NoSuchMethodError: org.apache.poi.util.POILogger.log(I[L
java/lang/Object;)V
        at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.parseRelationshipsPart(PackageRelationshipCollection.java:313)
        at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:163)
        at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:131)
        at org.apache.poi.openxml4j.opc.PackagePart.loadRelationships(PackagePart.java:561)
        at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:109)
        at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:80)
        at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:125)
        at org.apache.poi.openxml4j.opc.ZipPackagePart.<init>(ZipPackagePart.java:78)
        at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:245)
        at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:684)
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:227)
        at org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:208)
        at org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:145)
        at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
        at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)

我意识到我的路径有 POI jar 版本 2.5.1 并且根据 maven 中央仓库我是一只恐龙(看起来像)这可能是为什么?把所有 这些分别用于 poi artifacts 和 xmlbeans 的 3.13 和 2.60 版本(@venkyreddy 在该答案中建议).

I realised that my path has POI jar version 2.5.1 and according to maven central repo I am a dinosaur (seems like) is that possibly why? I am also getting error after putting all these for versions 3.13 and 2.60 for poi artifacts and xmlbeans respectively (suggested by @venkyreddy in that answer).

更新我尝试从我的原始工作中单独构建一个新项目,并仅在我的类路径中使用 tika-app-1.10.jar.我还调查了 tika-app-1.10.jar 并发现所有 POI 依赖项实际上都在那里,包括 xmlbeans 和 'xml-schema'.在我的类路径中只保留 tika-app-1.10.jar 后,我收到以下错误(不是异常):

UPDATE I tried building a new project separately from my original work, and used tika-app-1.10.jar ONLY in my classpath. I also investigated the tika-app-1.10.jar and found out that all the POI dependencies are actually there inluding xmlbeans and 'xml-schema'. After keeping only tika-app-1.10.jar in my classpath, I am getting the following Error (not Exception):

java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader
        at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
        at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:158)
        at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:167)
        at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:119)
        at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:59)
        at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:204)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at xxx.xxx.xxx.xxx.xxxxxAttachmentWithTika(xxxService.java:792)

我浏览了该包,但找不到任何 POIXMLTypeLoader 类.这是一个已知的问题?有人可以回复我吗?

I browsed the package and couldn't find any POIXMLTypeLoader class. is this a known issue? Could someone please respond to me?

确保没有过时的 POI jars 并使用与您尝试使用的 Tika 版本匹配的 POI 版本.

Make sure there are no outdated POI jars and use the version of POI which matches the version of Tika that you are trying to use.

POIXMLTypeLoader 类是在 POI 3.13 发布后添加到 POI 中的,因此您似乎以某种方式混合了较新的版本.只有发布 POI 3.14-beta1 才知道这个类!确保您不以某种方式包含该版本.

The class POIXMLTypeLoader was added to POI after POI 3.13 was released, so it seems you somehow mix newer versions. Only release POI 3.14-beta1 knows about this class! Make sure you do not include that version somehow.