如何从PDF文件中删除所有图像/绘图并仅以Java格式保留文本?
我有一个PDF文件,它是OCR处理器的输出,这个OCR处理器识别图像,将文本添加到pdf但最后放置一个低质量的图像而不是原始图像(我不知道为什么有人会这样做,但是他们会这样做。)
I have a PDF file that's an output from an OCR processor, this OCR processor recognizes the image, adds the text to the pdf but at the end places a low quality image instead of the original one (I have no idea why anyone would do that, but they do).
所以,我想得到这个PDF,删除图像流并保留文本,这样我就可以得到它并导入(使用iText页面导入功能)到PDF我用自己的真实图像创建自己。
So, I would like to get this PDF, remove the image stream and leave the text alone, so that I could get it and import (using iText page importing feature) to a PDF I'm creating myself with the real image.
在有人要求之前,我已经尝试过使用其他工具提取文本坐标(JPedal),但是当我在PDF上绘制文本时,它与原始文本的位置不同。
And before someone asks, I have already tried to use another tool to extract text coordinates (JPedal) but when I draw the text on my PDF it isn't at the same position as the original one.
我宁愿这样做用Java完成,但如果其他工具可以做得更好,请告诉我。它可能只是图像删除,我可以使用带有图纸的PDF格式。
I'd rather have this done in Java, but if another tool can do it better, just let me know. And it could be image removal only, I can live with a PDF with the drawings in there.
我使用了Apache PDFBox类似的情况。
I used Apache PDFBox in similar situation.
为了更具体一点,尝试类似的事情:
To be a little bit more specific, try something like that:
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import java.io.IOException;
public class Main {
public static void main(String[] argv) throws COSVisitorException, InvalidPasswordException, CryptographyException, IOException {
PDDocument document = PDDocument.load("input.pdf");
if (document.isEncrypted()) {
document.decrypt("");
}
PDDocumentCatalog catalog = document.getDocumentCatalog();
for (Object pageObj : catalog.getAllPages()) {
PDPage page = (PDPage) pageObj;
PDResources resources = page.findResources();
resources.getImages().clear();
}
document.save("strippedOfImages.pdf");
}
}
它应该删除所有类型的图像(png, jpeg,...)。它应该是这样的:
It's supposed to remove all types of images (png, jpeg, ...). It should work like that: