如何检测文档中的图像

问题描述：

如何检测文档中的图像，例如 doc、xls、ppt 或 pdf?

How can I detect images in a document say doc,xls,ppt or pdf ?

我遇到了 Apache Tika，我正在尝试它的命令行选项.http://tika.apache.org/1.2/gettingstarted.html

I came across with Apache Tika, I am trying its command line option. http://tika.apache.org/1.2/gettingstarted.html

但不太确定它将如何检测图像.

But not quite sure how it will detect images.

感谢任何帮助.

谢谢

答

您已经说过要使用命令行解决方案，而不是编写任何 Java 代码，因此这不是最好的方法... 如果您乐于编写一点 Java，并创建一个新程序以从 Python 调用，那么您可以做得更好！

You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then you can do it much nicer!

首先要做的是让 Tika 应用程序提取文件中的所有嵌入资源.为此使用 --extract 选项，并在应用控制的特殊临时目录中进行提取，例如

The first thing to do is to have the Tika App extract out any embedded resources within your file. Use the --extract option for this, and have the extraction occur in a special temp directory you app controls, eg

$ java -jar tika.jar --extract ../testWORD_embedded_pdf.doc
Extracting 'image1.emf' (application/x-emf)
Extracting '_1402837031.pdf' (application/pdf)

如果可以，获取提取的输出，并解析寻找图像(但请注意，某些图像在其规范的 mimetype 上具有 application/ 前缀！).您可能需要对一些(我不确定)运行第二个 --detect 步骤，测试解析器如何进行提取.

Grab the output of the extraction if you can, and parse that looking for images (but be aware that some images have an application/ prefix on their canconical mimetype!). You might need to run a second --detect step on a few, I'm not sure, test how the parsers get on with the extraction.

现在，如果有图像，它们将在您的测试目录中.根据需要处理它们.最后，当你完成文件时，zap 临时目录！

Now, if there were images, they'll be in your test dir. Process them as you want. Finally, zap the temp dir when you're done with the file!

相关推荐