iTextSharp从特定位置读取
从PDF文件中读取数据时,我遇到使用iTextSharp的问题。我想要实现的是只读取PDF页面的特定部分(我想只检索位于恒定位置的地址信息)。我在阅读以下所有页面时看到了iTextSharp的用法:
I have a problem using iTextSharp when reading data from PDF File. What I want to achieve is to read only specific part of PDF page (I want to only retrieve Address Information, which is located at constant position). I have seen usage of iTextSharp when reading all pages such as following:
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
但我怎样才能将其限制在特定位置?我愿意使用任何东西,甚至是OCR技术,因为将来某些文件可能会成为图像(但此时不是必需的)。这个项目仅供我使用,所以没有商业用途。
But how can I only restrict it to a specific location? I am open to use anything, even OCR technique as it might happen in the future that some files will be images(but not neccessary at this time). This project is only for me, so no commercial use.
谢谢!
您使用的是 SimpleTextExtractionStrategy
而不是 LocationTextExtractionStrategy
。请阅读官方文档和随附的示例( Java / C#)。如果 rect
是一个基于地址坐标的矩形,则需要:
You are using a SimpleTextExtractionStrategy
instead of a LocationTextExtractionStrategy
. Please read the official documentation and the accompanying examples (Java / C#). If rect
is a rectangle based on the coordinates of your address, you need:
RenderFilter[] filter = {new RegionTextRenderFilter(rect)};
ITextExtractionStrategy strategy;
StringBuilder sb = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++) {
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));
}
现在,您将获得与 rect (因此部分文字可能在 rect
之外,iText不会将文本片段分割成碎片。)
Now you'll get all the text snippets that intersect with the rect
(so part of the text may be outside rect
, iText doesn't cut text snippets in pieces).
请注意,您可以使用以下方式获取页面的MediaBox:
Note that you can get the MediaBox of a page using:
Rectangle mediabox = reader.GetPageSize(pagenum);
左下角的坐标为x = mediabox.Left
和y = mediabox.Bottom
;右上角的坐标是x = mediabox.Right
和y = mediabox.Top
。
The coordinate of the lower-left corner is x = mediabox.Left
and y = mediabox.Bottom
; the coordinate of the upper-right corner is x = mediabox.Right
and y = mediabox.Top
.
x的值从左到右增加; y的值从下到上增加。 PDF中的测量系统的单位称为用户单位。默认情况下,一个用户单元与一个点重合(这可能会更改,但您找不到许多具有不同UserUnit值的PDF)。在正常情况下,72个用户单位= 1英寸。
The values of x increase from left to right; the values of y increase from bottom to top. The unit of the measurement system in PDF is called "user unit". By default one user unit coincides with one point (this can change, but you won't find many PDFs with a different UserUnit value). In normal circumstances, 72 user units = 1 inch.