在itextSharp中使用LocationTextExtractionStrategy进行文本坐标
我的目标是从PDF中获取数据,可能是表格结构到Excel文件。
My goal is to retrieve data from PDF which may be in table structure to an excel file.
使用iTextSharp使用LocationTextExtractionStrategy,我们可以以纯文本获取字符串数据页面内容以左到右的方式。
using LocationTextExtractionStrategy with iTextSharp we can get the string data in plain text with page content in left to right manner.
如何在
PdfTextExtractor.GetTextFromPage(reader,i,new LocationTextExtractionStrategy())
PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy())
我可以让文本保留在例如,如果pdf中的第一行文本与右对齐,则生成的字符串必须包含尾随空格或空格,以保持内容正确对齐。
I could make the text retain its coordinate in the resulting string.
请提供一些建议,我可以如何实现相同的。
Please give some suggestions, how I may proceed to achieve the same.
非常重要的是要了解PDF 不支持表。任何 像表格一样,真的只是一堆放在特定位置上的文字。这是非常重要的,你需要牢记这一点。
Its very important to understand that PDFs have no support for tables. Anything that looks like a table is really just a bunch of text placed at specific locations over a background of lines. This is very important and you need to keep this in mind as you work on this.
那就是说,你需要将 TextExtractionStrategy
并将其传递到 GetTextFromPage()
中。请参见此帖子一个简单的例子。然后看到这篇文章,以获得更复杂的子类化示例。后者与您的目标并不完全相关,但它确实显示出一些更复杂的事情,您可以做。
That said, you need to subclass TextExtractionStrategy
and pass that into GetTextFromPage()
. See this post for a simple example of that. Then see this post for a more complex example of subclassing. The latter isn't completely relevant to your goal but it does show some more complex things that you can do.