如何从PDF文件中删除所有图像/绘图并仅以Java格式保留文本？

问题描述：

我有一个PDF文件，它是OCR处理器的输出，这个OCR处理器识别图像，将文本添加到pdf但最后放置一个低质量的图像而不是原始图像（我不知道为什么有人会这样做，但是他们会这样做。）

I have a PDF file that's an output from an OCR processor, this OCR processor recognizes the image, adds the text to the pdf but at the end places a low quality image instead of the original one (I have no idea why anyone would do that, but they do).

所以，我想得到这个PDF，删除图像流并保留文本，这样我就可以得到它并导入（使用iText页面导入功能）到PDF我用自己的真实图像创建自己。

So, I would like to get this PDF, remove the image stream and leave the text alone, so that I could get it and import (using iText page importing feature) to a PDF I'm creating myself with the real image.

在有人要求之前，我已经尝试过使用其他工具提取文本坐标（JPedal），但是当我在PDF上绘制文本时，它与原始文本的位置不同。

And before someone asks, I have already tried to use another tool to extract text coordinates (JPedal) but when I draw the text on my PDF it isn't at the same position as the original one.

我宁愿这样做用Java完成，但如果其他工具可以做得更好，请告诉我。它可能只是图像删除，我可以使用带有图纸的PDF格式。

I'd rather have this done in Java, but if another tool can do it better, just let me know. And it could be image removal only, I can live with a PDF with the drawings in there.

答

我使用了Apache PDFBox类似的情况。

I used Apache PDFBox in similar situation.

为了更具体一点，尝试类似的事情：

To be a little bit more specific, try something like that:

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import java.io.IOException;

public class Main {
    public static void main(String[] argv) throws COSVisitorException, InvalidPasswordException, CryptographyException, IOException {
        PDDocument document = PDDocument.load("input.pdf");

        if (document.isEncrypted()) {
            document.decrypt("");
        }

        PDDocumentCatalog catalog = document.getDocumentCatalog();
        for (Object pageObj :  catalog.getAllPages()) {
            PDPage page = (PDPage) pageObj;
            PDResources resources = page.findResources();
            resources.getImages().clear();
        }

        document.save("strippedOfImages.pdf");
    }
}

它应该删除所有类型的图像（png， jpeg，...）。它应该是这样的：

It's supposed to remove all types of images (png, jpeg, ...). It should work like that:

示例文章http：// s3 .postimage.org / 28f6boykk / before.jpg 。

如何从PDF文件中删除所有图像/绘图并仅以Java格式保留文本？

相关推荐