在itextSharp中使用LocationTextExtractionStrategy进行文本坐标

问题描述：

我的目标是从PDF中获取数据，可能是表格结构到Excel文件。

My goal is to retrieve data from PDF which may be in table structure to an excel file.

使用iTextSharp使用LocationTextExtractionStrategy，我们可以以纯文本获取字符串数据页面内容以左到右的方式。

using LocationTextExtractionStrategy with iTextSharp we can get the string data in plain text with page content in left to right manner.

如何在

PdfTextExtractor.GetTextFromPage（reader，i，new LocationTextExtractionStrategy（））

PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy())

我可以让文本保留在例如，如果pdf中的第一行文本与右对齐，则生成的字符串必须包含尾随空格或空格，以保持内容正确对齐。

I could make the text retain its coordinate in the resulting string.

请提供一些建议，我可以如何实现相同的。

Please give some suggestions, how I may proceed to achieve the same.

答

非常重要的是要了解PDF 不支持表。任何像表格一样，真的只是一堆放在特定位置上的文字。这是非常重要的，你需要牢记这一点。

Its very important to understand that PDFs have no support for tables. Anything that looks like a table is really just a bunch of text placed at specific locations over a background of lines. This is very important and you need to keep this in mind as you work on this.

那就是说，你需要将 TextExtractionStrategy 并将其传递到 GetTextFromPage（）中。请参见此帖子一个简单的例子。然后看到这篇文章，以获得更复杂的子类化示例。后者与您的目标并不完全相关，但它确实显示出一些更复杂的事情，您可以做。

That said, you need to subclass TextExtractionStrategy and pass that into GetTextFromPage(). See this post for a simple example of that. Then see this post for a more complex example of subclassing. The latter isn't completely relevant to your goal but it does show some more complex things that you can do.

在itextSharp中使用LocationTextExtractionStrategy进行文本坐标

相关推荐