检查PDF文件是否有效(Python)
我是通过HTTP上载获取文件的,需要确保它是pdf文件.编程语言是Python,但这没关系.
I get a File via a HTTP-Upload and need to be sure its a pdf-file. Programing Language is Python, but this should not matter.
我想到了以下解决方案:
I thought of the following solutions:
-
检查字符串的开头字节是否为%PDF". 这不是一个很好的检查,但可以防止用户意外上传其他文件.
尝试libmagic(bash上的文件"命令使用它). 此检查与1.完全相同.
Try the libmagic (the "file" command on the bash uses it). This does exactly the same check as 1.
获取一个lib并尝试从文件中读取页数. 如果该库能够读取一个页面计数,则它应该是有效的pdf.问题:我不知道python的库可以做到这一点
Take a lib and try to read the page-count out of the file. If the lib is able to read a pagecount it should be a valid pdf. Problem: I dont know a lib for python which can do this
那么有人为lib或其他技巧找到了解决方案吗?
So anybody got any solutions for a lib or another trick?
谢谢
两个最常用的Python PDF库是:
The two most commonly used PDF libraries for Python are:
- pyPdf
- ReportLab
两者都是纯python,因此应该易于安装以及跨平台.
Both are pure python so should be easy to install as well be cross-platform.
使用pyPdf可能就像这样简单:
With pyPdf it would probably be as simple as doing:
from pyPdf import PdfFileReader
doc = PdfFileReader(file("upload.pdf", "rb"))
这应该足够了,但是如果您想进一步检查,doc
现在将具有documentInfo()
和numPages()
方法.
This should be enough, but doc
will now have documentInfo()
and numPages()
methods if you want to do further checking.
正如Carl回答的那样,pdftotext也是一个很好的解决方案,并且在非常大的文档(尤其是具有很多交叉引用的文档)中可能会更快.但是,由于分叉新进程的系统开销等原因,在小PDF上可能会稍慢一些.
As Carl answered, pdftotext is also a good solution, and would probably be faster on very large documents (especially ones with many cross-references). However it might be a little slower on small PDF's due to system overhead of forking a new process, etc.