在itextSharp中使用LocationTextExtractionStrategy进行文本坐标

在itextSharp中使用LocationTextExtractionStrategy进行文本坐标

问题描述:

我的目标是从PDF中获取数据,可能是表格结构到Excel文件。

My goal is to retrieve data from PDF which may be in table structure to an excel file.

使用iTextSharp使用LocationTextExtractionStrategy,我们可以以纯文本获取字符串数据页面内容以左到右的方式。

using LocationTextExtractionStrategy with iTextSharp we can get the string data in plain text with page content in left to right manner.

如何在


PdfTextExtractor.GetTextFromPage(reader,i,new LocationTextExtractionStrategy())

PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy())

我可以让文本保留在例如,如果pdf中的第一行文本与右对齐,则生成的字符串必须包含尾随空格或空格,以保持内容正确对齐。

I could make the text retain its coordinate in the resulting string.

请提供一些建议,我可以如何实现相同的。

Please give some suggestions, how I may proceed to achieve the same.

非常重要的是要了解PDF 不支持表。任何 像表格一样,真的只是一堆放在特定位置上的文字。这是非常重要的,你需要牢记这一点。

Its very important to understand that PDFs have no support for tables. Anything that looks like a table is really just a bunch of text placed at specific locations over a background of lines. This is very important and you need to keep this in mind as you work on this.

那就是说,你需要将 TextExtractionStrategy 并将其传递到 GetTextFromPage()中。请参见此帖子一个简单的例子。然后看到这篇文章,以获得更复杂的子类化示例。后者与您的目标并不完全相关,但它确实显示出一些更复杂的事情,您可以做。

That said, you need to subclass TextExtractionStrategy and pass that into GetTextFromPage(). See this post for a simple example of that. Then see this post for a more complex example of subclassing. The latter isn't completely relevant to your goal but it does show some more complex things that you can do.