LibreOffice将PDF转换为Word作为文本框而不是普通文档
我有4个文件:
1.original.doc
2.original-to-pdf.pdf
3.pdf-to-word.doc
4.expected.doc
首先我转换 original.pdf
到original-to-pdf.pdf
然后我尝试转换回Word使用以下命令:
soffice --infilter="writer_pdf_import" --convert-to docx a.pdf
文件创建成功,但所有内容转换为文本框不作为正常的文件。然后我尝试了几个PDF到Word的转换器,如ilovepdf.com和我得到的expected.doc
你可以通过上方的链接下载我的文件来查看不同的内容,也可以查看下面的图片
自定义查询结果:
ilovepdf 输出:
我尝试了几个过滤器,包括pdf到odt,然后odt到word,但所有命令下面没有给我预期的结果
soffice --infilter="writer_pdf_import" --convert-to docx a.pdf
soffice --infilter="writer_pdf_import" --convert-to docx:"Microsoft Word 2007/2010/2013 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS 2003 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc a.pdf
soffice --infilter="writer_pdf_import" --convert-to odf:"writer8" a.pdf
soffice --infilter="writer8" --convert-to doc a.odf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 95" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 97" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"StarOffice XML (Writer)" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 2003 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to docx:"MS Word 2003 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 2007 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 2003 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to docx:"MS Word 2007 XML Template" a.pdf
soffice --infilter="writer_pdf_import" --convert-to docx:"MS Word 2007 XML" a.pdf
soffice --infilter="Microsoft Word 2007/2010/2013 XML" --convert-to doc a.pdf
我知道一些高级软件 abbyy cloud
或者 adobe cloud
, 但我不认为像ilovepdf这样的网站会使用付费服务来提供免费服务。我的问题是,我是否遗漏了LibreOffice依赖中的一些东西,以便能够将PDF转换为正常的word文档?
Your problem lies with the software used to create the PDF; output in the form of textboxes in a PDF is a characteristic of certain low-end PDF-creation software. There is nothing Word can do about that during the import process; you would need to clean it up afterwards.
A Word macro you could use for the clean-up is:
Sub EraseTextBoxes()
Dim RngDoc As Range, RngShp As Range, i As Long
With ActiveDocument
For i = .Shapes.Count To 1 Step -1
With .Shapes(i)
If .Type = msoTextBox Then
Set RngShp = .TextFrame.TextRange
RngShp.End = RngShp.End - 1
Set RngDoc = .Anchor
RngDoc.Collapse wdCollapseEnd
RngDoc.FormattedText = RngShp.FormattedText
.Delete
End If
End With
Next
End With
End Sub
Do note that whether the macro positions the output correctly depends on where the textboxes are anchored; if the anchor positions are unrelated to the textbox locations, you'll end up with a dog's breakfast. You'll probably still also end up with each line as its own paragraph. To clean up such content, see http://www.msofficeforums.com/word/29880-cleaning-up-text-pasted-websites-e-mails.html