检查PDF文件是否有效(Python)

问题描述：

我是通过HTTP上载获取文件的，需要确保它是pdf文件.编程语言是Python，但这没关系.

I get a File via a HTTP-Upload and need to be sure its a pdf-file. Programing Language is Python, but this should not matter.

我想到了以下解决方案:

I thought of the following solutions:

检查字符串的开头字节是否为％PDF". 这不是一个很好的检查，但可以防止用户意外上传其他文件.

尝试libmagic(bash上的文件"命令使用它). 此检查与1.完全相同.

Try the libmagic (the "file" command on the bash uses it). This does exactly the same check as 1.

获取一个lib并尝试从文件中读取页数. 如果该库能够读取一个页面计数，则它应该是有效的pdf.问题:我不知道python的库可以做到这一点

Take a lib and try to read the page-count out of the file. If the lib is able to read a pagecount it should be a valid pdf. Problem: I dont know a lib for python which can do this

那么有人为lib或其他技巧找到了解决方案吗?

So anybody got any solutions for a lib or another trick?

谢谢

答

两个最常用的Python PDF库是:

The two most commonly used PDF libraries for Python are:

pyPdf
ReportLab

pyPdf
ReportLab

两者都是纯python，因此应该易于安装以及跨平台.

Both are pure python so should be easy to install as well be cross-platform.

使用pyPdf可能就像这样简单:

With pyPdf it would probably be as simple as doing:

from pyPdf import PdfFileReader
doc = PdfFileReader(file("upload.pdf", "rb"))

这应该足够了，但是如果您想进一步检查，doc现在将具有documentInfo()和numPages()方法.

This should be enough, but doc will now have documentInfo() and numPages() methods if you want to do further checking.

正如Carl回答的那样，pdftotext也是一个很好的解决方案，并且在非常大的文档(尤其是具有很多交叉引用的文档)中可能会更快.但是，由于分叉新进程的系统开销等原因，在小PDF上可能会稍慢一些.

As Carl answered, pdftotext is also a good solution, and would probably be faster on very large documents (especially ones with many cross-references). However it might be a little slower on small PDF's due to system overhead of forking a new process, etc.

检查PDF文件是否有效(Python)

相关推荐