从Objective-C中的PDF中提取可编辑字段
我一直在研究在我的iOS应用程序中使用PDF。我已经想出了一些难题,比如扫描操作员并在UIWebView中显示PDF。但是,我真正需要做的是识别PDF文档中的可编辑字段。
I've been researching working with PDFs in my iOS app for a while now. I've figured out a few pieces of the puzzle like scanning for operators and displaying the PDF in a UIWebView. However, what I really need to do is identify editable fields within a PDF document.
理想情况下,我希望能够直接与字段进行交互,但这听起来非常困难,而且不是明显的第一步。我已经可以通过这种方式与可以操作PDF的Windows服务连接,并且可以确定可编辑字段,在表单视图中从用户收集字段数据,以及将这些数据发送回服务器。问题是我无法看到如何识别字段。我正在与政府发行的PDF格式(如I-9和W-4)进行交互,因此我无法控制PDF的创建或字段的命名。这就是我需要动态提取它们的原因。任何帮助和/或参考将不胜感激。
Ideally I would like to be able to interact with the fields directly but that sounds very difficult and not an obvious first step. I am already interfacing with a Windows service that can manipulate PDFs in this way and could settle for identifying the editable fields, gathering the field data from the user in a form view, and POSTing that data back to the server. The problem is that I can't see how to identify the fields. I'm interacting with government issued PDFs such as I-9s and W-4s so I do not have control over the creation of the PDFs or the naming of the fields. That is why I need to extract them dynamically. Any help and/or references would be appreciated.
我正在使用[此参考](https://developer.apple.com/library/mac/#documentation Apple的Quatrz 2D编程指南中的/graphicsimaging/conceptual/drawingwithquartz2d/dq_pdf_scan/dq_pdf_scan.html\"PDF Document Parsing)在扫描PDF时触发操作员回调,但这并没有帮助我找到可编辑的字段。
I'm using [this reference](https://developer.apple.com/library/mac/#documentation/graphicsimaging/conceptual/drawingwithquartz2d/dq_pdf_scan/dq_pdf_scan.html"PDF Document Parsing") from Apple's Quatrz 2D Programming guide to trigger operator callbacks when scanning a PDF but that isn't helping me find the editable fields.
我也只是加载带有PDF数据的UIWebView以显示给用户。
I'm also simply loading a UIWebView with the PDF data to display to the user.
[_webView loadData:decodedData MIMEType:@"application/pdf" textEncodingName:@"utf-8" baseURL:nil];
更新:
我制作了一个PDF Helper类(如下所示)遍历目录中的所有可能的对象类型。最初我没有处理数组中的嵌套字典,所以我没有看到表单字段。一旦我修复了,我意识到有必须考虑的父引用,以避免循环递归调用,这将启动无限循环。下面的代码显示了文档目录中的大量信息。现在我只需要解析它以隔离我需要的表单字段。
I built a PDF Helper class (shown below) to traverse all possible object types in the catalog. Originally I was not handling nested dictionaries within arrays so I was not seeing the form fields. Once I fixed that I realized that there were parent references that I had to account for to avoid circular recursive calls that would start an infinite loop. The code below shows a wealth of information from the document catalog. Now I just need to parse it to isolate the form fields I need.
PDFHelper.h
PDFHelper.h
#import <Foundation/Foundation.h>
id selfClass;
@interface PDFHelper : NSObject
@property (nonatomic, strong) NSData *pdfData;
@property (nonatomic, strong) NSMutableDictionary *pdfDict;
@property (nonatomic) int catalogLevel;
-(NSArray *) copyPDFArray:(CGPDFArrayRef)arr referencingDictionary:(CGPDFDictionaryRef)dict referencingKey:(const char *)key;
-(NSArray *) getFormFields;
-(CGPDFDictionaryRef) getDocumentCatalog;
@end
PDFHelper.m
PDFHelper.m
#import "PDFHelper.h"
#import "FileHelpers.h"
#import "Log.h"
@implementation PDFHelper
@synthesize pdfData = _pdfData;
@synthesize pdfDict = _pdfDict;
@synthesize catalogLevel = _catalogLevel;
-(id)init
{
self = [super init];
if(self)
{
selfClass = self;
_pdfDict = [[NSMutableDictionary alloc] init];
_catalogLevel = 1;
}
return self;
}
-(NSArray *) getFormFields
{
CGPDFDictionaryRef acroForm = NULL;
if (CGPDFDictionaryGetDictionary([self getPdfDocDictionary], "AcroForm", &acroForm))
CGPDFDictionaryApplyFunction(acroForm, getDictionaryObjects, acroForm);
return [_pdfDict objectForKey:@"XFA"];
}
-(CGPDFDictionaryRef) getDocumentCatalog
{
CGPDFDictionaryRef docCatalog = [self getPdfDocDictionary];
CGPDFDictionaryApplyFunction(docCatalog, getDictionaryObjects, docCatalog);
return docCatalog;
}
-(CGPDFDictionaryRef) getPdfDocDictionary
{
NSURL *pdf = [[NSURL alloc] initFileURLWithPath:[FileHelpers pathInLibraryDirectory:@"file.pdf"]];
[_pdfData writeToFile:[pdf path] atomically:YES];
CGPDFDocumentRef pdfDocument = CGPDFDocumentCreateWithURL((__bridge CFURLRef)pdf);
CGPDFDictionaryRef returnDict = CGPDFDocumentGetCatalog(pdfDocument);
return returnDict;
}
void getDictionaryObjects (const char *key, CGPDFObjectRef object, void *info) {
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"key: %s", key]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
CGPDFDictionaryRef contentDict = (CGPDFDictionaryRef)info;
CGPDFObjectType type = CGPDFObjectGetType(object);
switch (type) {
case kCGPDFObjectTypeNull: {
[Log LogDebug:[NSString stringWithFormat:@"*****pdf null value"]];
break;
}
case kCGPDFObjectTypeBoolean: {
CGPDFBoolean objectBoolean;
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeBoolean, &objectBoolean)) {
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf boolean value: %@", [NSNumber numberWithBool:objectBoolean]]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
[[selfClass pdfDict] setObject:[NSNumber numberWithBool:objectBoolean]
forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];
}
break;
}
case kCGPDFObjectTypeInteger: {
CGPDFInteger objectInteger;
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeInteger, &objectInteger)) {
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf integer value: %ld", (long int)objectInteger]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
[[selfClass pdfDict] setObject:[NSNumber numberWithInt:objectInteger]
forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];
}
break;
}
case kCGPDFObjectTypeReal: {
CGPDFReal objectReal;
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeReal, &objectReal)) {
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf real value: %ld", (long int)objectReal]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
[[selfClass pdfDict] setObject:[NSNumber numberWithInt:objectReal]
forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];
}
break;
}
case kCGPDFObjectTypeName: {
const char *name;
if (CGPDFDictionaryGetName(contentDict, key, &name))
{
NSString *dictName = [[NSString alloc] initWithCString:name encoding:NSUTF8StringEncoding];
if (dictName)
{
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf name value: %@", dictName]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
[[selfClass pdfDict] setObject:dictName
forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];
}
}
break;
}
case kCGPDFObjectTypeString: {
CGPDFStringRef objectString;
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeString, &objectString)) {
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf string value: %@", (__bridge NSString *)CGPDFStringCopyTextString(objectString)]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
[[selfClass pdfDict] setObject:(__bridge NSString *)CGPDFStringCopyTextString(objectString)
forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];
}
break;
}
case kCGPDFObjectTypeArray: {
CGPDFArrayRef objectArray;
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeArray, &objectArray)) {
NSArray *myArray=[selfClass copyPDFArray:objectArray referencingDictionary:contentDict referencingKey:key];
[[selfClass pdfDict] setObject:myArray
forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];
}
break;
}
case kCGPDFObjectTypeDictionary: {
CGPDFDictionaryRef objectDictionary;
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeDictionary, &objectDictionary)) {
NSString *logString = @"Found dictionary";
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
//[Log LogDebug:logString];
NSString *keyCheck = [[NSString alloc] initWithUTF8String:key];
if (![keyCheck isEqualToString:@"Parent"] && ![keyCheck isEqualToString:@"P"])
{
[selfClass setCatalogLevel:[selfClass catalogLevel] + 1];
CGPDFDictionaryApplyFunction(objectDictionary, getDictionaryObjects, objectDictionary);
[selfClass setCatalogLevel:[selfClass catalogLevel] - 1];
}
}
break;
}
case kCGPDFObjectTypeStream: {
CGPDFStreamRef objectStream;
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeStream, &objectStream)) {
CGPDFDictionaryRef dict = CGPDFStreamGetDictionary( objectStream );
CGPDFDataFormat fmt = CGPDFDataFormatRaw;
CFDataRef streamData = CGPDFStreamCopyData(objectStream, &fmt);
NSData *data = [[NSData alloc] initWithData:(__bridge NSData *)(streamData)];
[data writeToFile:[FileHelpers pathInDocumentDirectory:@"data.dat"] atomically:YES];
NSString *dataString = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
//if (!dataString) {
// dataString = [[NSString alloc] initWithData:(__bridge NSData *)(streamData) encoding:NSUTF16StringEncoding];
// }
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf stream length: %ld - %@", (long int)CFDataGetLength( streamData ), dataString]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
NSString *keyCheck = [[NSString alloc] initWithUTF8String:key];
if( dict && ![keyCheck isEqualToString:@"Parent"] && ![keyCheck isEqualToString:@"P"])
{
[selfClass setCatalogLevel:[selfClass catalogLevel] + 1];
CGPDFDictionaryApplyFunction(dict, getDictionaryObjects, dict);
[selfClass setCatalogLevel:[selfClass catalogLevel] - 1];
}
}
}
}
}
- (NSArray *)copyPDFArray:(CGPDFArrayRef)arr referencingDictionary:(CGPDFDictionaryRef)dict referencingKey:(const char *)key
{
int i = 0;
NSMutableArray *temp = [[NSMutableArray alloc] init];
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array count: %zu", CGPDFArrayGetCount(arr)]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
for(i=0; i<CGPDFArrayGetCount(arr); i++){
CGPDFObjectRef object;
CGPDFArrayGetObject(arr, i, &object);
CGPDFObjectType type = CGPDFObjectGetType(object);
switch(type){
case kCGPDFObjectTypeNull: {
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array null(%d)", i]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
break;
}
case kCGPDFObjectTypeBoolean: {
CGPDFBoolean objectBool;
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeBoolean, &objectBool)) {
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array boolean value(%d): %@", i, [NSNumber numberWithBool:objectBool]]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
[temp addObject:[NSNumber numberWithBool:objectBool]];
}
break;
}
case kCGPDFObjectTypeInteger: {
CGPDFInteger objectInteger;
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeInteger, &objectInteger)) {
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array integer value(%d): %ld", i, (long int)objectInteger]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
[temp addObject:[NSNumber numberWithInt:objectInteger]];
}
break;
}
case kCGPDFObjectTypeReal:
{
CGPDFReal objectReal;
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeReal, &objectReal))
{
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array real(%d): %ld", i, (long int)objectReal]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
[temp addObject:[NSNumber numberWithInt:objectReal]];
}
break;
}
case kCGPDFObjectTypeName:
{
const char *name;
if (CGPDFDictionaryGetName(dict, key, &name))
{
NSString *dictName = [[NSString alloc] initWithCString:name encoding:NSUTF8StringEncoding];
if (dictName)
{
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array name value(%d): %@", i, dictName]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
[[selfClass pdfDict] setObject:dictName
forKey:[NSString stringWithCString:key encoding:NSUTF8StringEncoding]];
}
}
break;
}
case kCGPDFObjectTypeString:
{
CGPDFStringRef objectString;
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeString, &objectString))
{
NSString *tempStr = (__bridge NSString *)CGPDFStringCopyTextString(objectString);
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array string(%d): %@", i, tempStr]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
[temp addObject:tempStr];
}
break;
}
case kCGPDFObjectTypeArray :
{
CGPDFArrayRef objectArray;
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeArray, &objectArray))
{
NSArray *tempArr = [selfClass copyPDFArray:objectArray referencingDictionary:dict referencingKey:key];
[temp addObject:tempArr];
}
break;
}
case kCGPDFObjectTypeDictionary :
{
CGPDFDictionaryRef objectDict;
NSString *keyCheck = [[NSString alloc] initWithUTF8String:key];
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeDictionary, &objectDict) && ![keyCheck isEqualToString:@"Parent"] && ![keyCheck isEqualToString:@"P"])
{
[selfClass setCatalogLevel:[selfClass catalogLevel] + 1];
CGPDFDictionaryApplyFunction( objectDict, getDictionaryObjects, objectDict);
[selfClass setCatalogLevel:[selfClass catalogLevel] - 1];
}
break;
}
case kCGPDFObjectTypeStream :
{
CGPDFStreamRef objectStream;
if (CGPDFObjectGetValue(object, kCGPDFObjectTypeStream, &objectStream))
{
CGPDFDictionaryRef streamDict = CGPDFStreamGetDictionary( objectStream );
CGPDFDataFormat fmt = CGPDFDataFormatRaw;
CFDataRef streamData = CGPDFStreamCopyData(objectStream, &fmt);
NSString *dataString = [[NSString alloc] initWithData:(__bridge NSData *)(streamData) encoding:NSUTF8StringEncoding];
NSString *logString = [[NSString alloc] initWithString:[NSString stringWithFormat:@"pdf array stream length: (%d): %ld - %@", i, (long int)CFDataGetLength( streamData ), dataString]];
for (int i = 0; i < [selfClass catalogLevel]; i++)
logString = [NSString stringWithFormat:@"-%@", logString];
[Log LogDebug:logString];
NSString *keyCheck = [[NSString alloc] initWithUTF8String:key];
if( streamDict && ![keyCheck isEqualToString:@"Parent"] && ![keyCheck isEqualToString:@"P"])
{
[selfClass setCatalogLevel:[selfClass catalogLevel] + 1];
CGPDFDictionaryApplyFunction( streamDict, getDictionaryObjects, streamDict );
[selfClass setCatalogLevel:[selfClass catalogLevel] - 1];
}
}
}
}
}
return temp;
}
@end
可编辑字段是指可以使用Acrobat或Adobe Reader填写的表单元素的类型吗?
With "editable fields" you mean the type of form elements that can be filled in using Acrobat or Adobe Reader?
这些字段不是实际的一部分页面描述。如果您查看PDF规范文档,您将在第12.7章中找到交互式表单的说明,该说明解释了文档的字段字典是从文档目录中名为AcroForm的元素开始存储的。
Those fields are not part of the actual page description. If you look at the PDF Specification document, you'll find a description of "Interactive Forms" in chapter 12.7 that explains that field dictionaries for a document are stored starting from an element called "AcroForm" in the document catalogue.
iOS确实允许您访问文档目录,因此您必须在该目录字典中找到AcroForm字段,然后进入字段字典结构以收集你想要的信息。完整文档中的所有字段都以分层方式存储在此处。
iOS does give you access to the document catalog as far as I know so you would have to find the "AcroForm" field in that catalog dictionary and then descend into the field dictionary structure to collect the information you want. All fields from the complete document are stored in a hierarchical fashion in this place.