我可以使用Telerik Document Processing读取PDF内容吗?
我正在一个项目中使用Telerik的文档处理库,我希望可以使用它来读取PDF文件并搜索可以用于其他处理的特定文本.但是,尽管执行此操作的代码似乎很简单,但实际上并没有得到预期的结果.这是我一起提出的概念证明:
I am working on a project where Telerik's Document Processing libraries are a available to me, and I was hoping that I would be able to use it to read a PDF file and search for specific text that I can use for other processing. But while the code to do so seems straightforward, I am not actually getting expected results. This is the proof of concept I threw together:
var fs = new FileStream("..\\some.pdf", FileMode.Open);
RadFixedDocument doc = new PdfFormatProvider(fs).Import();
var pageCt = 0;
var elementCt = 0;
foreach (var page in doc.Pages) {
pageCt += 1;
Console.WriteLine($"Page {pageCt}, (Has content: {page.HasContent}, {page.Content.Count})");
foreach (var contentEl in page.Content) {
elementCt += 1;
Console.WriteLine($"Element {elementCt}");
if (contentEl is TextFragment) {
string text = (contentEl as TextFragment).Text;
Console.WriteLine(text);
// if (text.Contains("{{CustomTag}}")) {
// Console.WriteLine(text);
// } else {
// Console.Write(".");
// }
}
else {
Console.WriteLine($"Content Type: {contentEl.GetType().ToString()}");
}
}
}
我已经在许多文档上对此进行了测试,但是虽然看起来可以选择适当的页面数量,但是每个页面都报告 HasContent
为 false
和内容
集合为空.
I have tested this on a number of documents, but while it seems to pick out the proper number of pages, each page reports HasContent
is false
and the Content
collection is empty.
我认为我应该能够以这种方式逐步浏览PDF内容元素是不正确的吗?
Am I not correct in thinking I should be able to step through the PDF content elements this way?
确定.这是一个非常奇怪的交易,但是在同事的一点帮助下,我们设法使这项工作奏效.原来不同之处在于您如何应用 FileStream
.
OK. This is a pretty strange deal, but with a little help from a colleague, we managed to get this working. Turns out the difference is in how you apply the FileStream
.
所以不是
RadFixedDocument doc =新的PdfFormatProvider(fs).Import();
我们使用
RadFixedDocument doc = new PdfFormatProvider().Import(fs);
其他所有功能都相同-它可以正常工作.
And with everything else the same - it works.