apache tika解析器对象解析.doc和.docx文件格式的问题

问题描述:

当我尝试使用org.apache.tika.parser.ParserDefaultDetector()来检测和解析.doc和.docx文件格式时.但是我从Tika罐子中抛出了一些错误(不是异常),这对我来说没有任何有用的堆栈跟踪信息.我可以确认这仅适用于.doc和.docx. PDF,jpeg,文本都可以.有人遇到过.doc和.docx文件格式的问题吗?您有采用的解决方案吗?

When I try to use org.apache.tika.parser.Parser and DefaultDetector() to detect and parse the .doc and .docx file formats. But I am getting some error (not exception) thrown from Tika jars and that doesn't have any helpful stack trace for me to put here. I can confirm that it is happening for .doc and .docx only. PDF, jpeg, texts are fine. Has anyone come across this problem with .doc and .docx file formats? is there any solution that you have adopted?

我的代码如下:

unzippedBytes = loadUnzippedByteCode(attachment.getContents()); /* This is utility method written using native Java Zip library - returns byte array byte[] */

            /* All the objects below were declared beforehand, but not initialised until now */

            parseContextObj = new ParseContext();
            dObj = new DefaultDetector();
            detectedParser = new AutoDetectParser(dObj);
            context.set(Parser.class, parser);
            OutputStream outputstream = new ByteArrayOutputStream();
            metadata = new Metadata();

            InputStream input = TikaInputStream.get(unzippedBytes, metadata);
            ContentHandler handler = new BodyContentHandler(outputstream);
            detectedParser.parse(input, handler, metadata, parseContextObj); // This is where it is throwing NoSuchMethodError - cannot understand why and also cannot get the stacktrace - using tika 1.10 */ 
            input.close();

上面的代码也是我在其他一些SO问题中发现的,并决定将其用于我的工作.另外,我使用的byte []是我从非常老的struts 1.0 FormFile接口(返回byte []的getFileData())中接收到的东西.我曾经使用过扩音器的irex解析器进行解析,但是出于多种原因,决定使用Tika. byte []在irex上工作正常,但是每当我尝试解析.docx和.doc内容时都会遇到问题.

The code above was something that I also found in some other SO question and decided to use it for my work. Also, the byte[] that I have used is something that I am receiving from very old struts 1.0 FormFile interface (getFileData() that returns byte[]). I used to have the bullhorn's irex parser to parse, but decided to use Tika for numerous reasons. the byte[] works fine with irex, but has issues whenever I am trying to parse .docx and .doc contents.

以下是出于隐私原因而屏蔽了某些部分的堆栈跟踪:

The following is the stack trace which I masked certain parts of due to privacy reasons:

2016-01-15 16:21:06,947 [http-apr-80-exec-3] [ERROR] XXXXX.XXXX.XXXXService - java.lang.NoSuchMethodError: org.apache.poi.util.POILogger.log(I[L
java/lang/Object;)V
        at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.parseRelationshipsPart(PackageRelationshipCollection.java:313)
        at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:163)
        at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:131)
        at org.apache.poi.openxml4j.opc.PackagePart.loadRelationships(PackagePart.java:561)
        at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:109)
        at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:80)
        at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:125)
        at org.apache.poi.openxml4j.opc.ZipPackagePart.<init>(ZipPackagePart.java:78)
        at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:245)
        at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:684)
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:227)
        at org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:208)
        at org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:145)
        at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
        at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)

我意识到我的道路上有POI jar版本2.5.1,并且根据Maven Central Repo的说法,我是恐龙(似乎是恐龙),那可能是为什么吗?将所有这些分别用于poi工件和xmlbeans的3.13和2.60版本(在该答案中由@venkyreddy建议).

I realised that my path has POI jar version 2.5.1 and according to maven central repo I am a dinosaur (seems like) is that possibly why? I am also getting error after putting all these for versions 3.13 and 2.60 for poi artifacts and xmlbeans respectively (suggested by @venkyreddy in that answer).

更新 我尝试与原始工作分开构建一个新项目,并仅在类路径中使用tika-app-1.10.jar.我还研究了tika-app-1.10.jar,发现所有POI依赖项实际上都包含在其中,包括xmlbeans和'xml-schema'.在我的类路径中仅保留tika-app-1.10.jar后,出现以下错误(不是异常):

UPDATE I tried building a new project separately from my original work, and used tika-app-1.10.jar ONLY in my classpath. I also investigated the tika-app-1.10.jar and found out that all the POI dependencies are actually there inluding xmlbeans and 'xml-schema'. After keeping only tika-app-1.10.jar in my classpath, I am getting the following Error (not Exception):

java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader
        at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
        at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:158)
        at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:167)
        at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:119)
        at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:59)
        at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:204)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at xxx.xxx.xxx.xxx.xxxxxAttachmentWithTika(xxxService.java:792)

我浏览了该程序包,但找不到任何POIXMLTypeLoader类.这是一个已知的问题?有人可以回应我吗?

I browsed the package and couldn't find any POIXMLTypeLoader class. is this a known issue? Could someone please respond to me?

确保没有过时的POI罐,并使用与您要使用的Tika版本匹配的POI版本.

Make sure there are no outdated POI jars and use the version of POI which matches the version of Tika that you are trying to use.

在发布POI 3.13之后,将类POIXMLTypeLoader添加到了POI,因此您似乎以某种方式混合了较新的版本.仅发布POI 3.14-beta1知道此类!确保不以某种方式不包含该版本.

The class POIXMLTypeLoader was added to POI after POI 3.13 was released, so it seems you somehow mix newer versions. Only release POI 3.14-beta1 knows about this class! Make sure you do not include that version somehow.