如何使用 XMLReader 解析 XML 元素/子元素的多个同名属性

问题描述:

我正在使用 XMLReader 和 PHP 来处理一个中等大小的 XML 文件 (6mb),并且基本上分解了属性数据并将其插入到我自己的数据库中.问题是,每个元素都有可变数量的具有相同命名属性的子元素.

I'm using XMLReader and PHP to process a moderately-sized XML file (6mb) and basically break up the attribute data and insert it into my own database. Problem is, each element has a variable number of subelements with identically named attributes.

这是一个例子(这是关于政府的公开数据,由 govtrack.us 提供):

Here's an example (this is open data about the government courtesy of govtrack.us):

<?xml version="1.0" ?>
<people>
    <person id='400001' lastname='Abercrombie' firstname='Neil' birthday='1938-06-26' gender='M' pvsid='26827' osid='N00007665' bioguideid='A000014' metavidid='Neil_Abercrombie' youtubeid='hawaiirep1' name='Rep. Neil Abercrombie [D, HI-1]' title='Rep.' state='HI' district='1' >
        <role type='rep' startdate='1985-01-03' enddate='1986-10-18' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='1991-01-03' enddate='1992-10-09' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='1993-01-05' enddate='1994-12-01' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='1995-01-04' enddate='1996-10-04' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='1997-01-07' enddate='1998-12-19' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='1999-01-06' enddate='2000-12-15' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='2001-01-03' enddate='2002-11-22' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='2003-01-07' enddate='2004-12-09' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
        <role type='rep' startdate='2005-01-04' enddate='2006-12-08' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
        <role type='rep' startdate='2007-01-04' enddate='2009-01-03' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
        <role type='rep' startdate='2009-01-06' enddate='2010-03-01' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
</person>

我不需要对属性执行任何花哨的逻辑.在我的脚本开始时,我检查我是否已经处理了这个特定的记录(基于 'id' 属性),然后我抓取几乎每个属性并将其解析到我的数据库中.但是有两个问题:

I don't need any fancy logic to be performed on the attributes. At the beginning of my script, I check to see if I've already processed this particular record (based on the 'id' attribute), and then I grab pretty much every attribute and parse it into my db. But there are two problems:

1) 当我使用它时:

$p->getAttribute('id')

为了获得id",它给了我两次,由与元素中的子元素一样多的换行符分隔(我认为对 这个页面 说明了这一点,但我不知道该怎么做.

to get the 'id', it gives it to me twice, separate by as many line breaks as there are subelements in the element (I think the comment on this page speaks to that, but I'm not sure what to do about it).

2) 如何按顺序访问每个子元素的属性?这:

2) How do I access the attributes of each subelement sequentially? This:

$p->getAttribute('startdate')

给我每个由多个换行符分隔的开始日期"值.我只需要获取元素的 id,然后循环遍历每个角色"子元素.

gives me every 'startdate' value separated by multiple line breaks. I just need to grab the id of the element and then cycle through each of the 'role' subelements.

有什么想法吗?

为了启发,这是我目前拥有的超级简单的控制器:

For edification, here's the super-simple controller I have so far:

$f = base_url().'data/people.xml';
$p = new XMLReader;
$p->open($f);
while($p->read())
{
    if($this->_notImported('govtrack',$p->getAttribute('id')))
    {
            // here I just grab the attributes, put them into arrays to insert, like so:
            $insert = array('indiv_name' => $full_name,
                                    'indiv_first' => ($p->getAttribute(‘firstname’)),
                                    'indiv_last' => ($p->getAttribute(‘lastname’)),
                                    'indiv_middle' => ($p->getAttribute(‘middlename’)),
                                    'indiv_other' => ($p->getAttribute(‘namemod’)),
                                    'indiv_full_name' => $full_name,
                                    'indiv_title' => ($p->getAttribute(‘title’)),
                                    'indiv_dob' => ($p->getAttribute(‘birthday’)),
                                    'indiv_gender' => ($p->getAttribute(‘gender’)),
                                    'indiv_religion' => ($p->getAttribute(‘religion’)),
                                    'indiv_url' => ($url)
                                    );

对于元素来说,这并没有那么难,但我不知道如何遍历每个'角色'子元素并分别抓取属性.

For the element, this is not as difficult, but I don't know how to cycle through each of the 'role' subelements and grab the attributes separately.

您的第一个问题是您没有检查适当的 nodeType,这实际上与您链接的评论有关:它匹配开始标记 (ELEMENT) 和结束标记 (END_ELEMENT).

Your first problem is that you are not checking for the appropriate nodeType, which is in fact related to the comment you linked: it matches both for the opening tag (ELEMENT) and the closing tag (END_ELEMENT).

您的第二个问题也与缺少 nodeType 检查有关.修复该问题后,您只需检查节点的 name 以确定它是 还是 .

Your second issue is also related to the missing nodeType check. After you fix that, you just have to check for the node's name to find out if it's a <role> or <person>.

因为我假设您还在读取一个大型 XML 文件,所以您可能想知道何时传递给 next person 标签...(通过 END_ELEMENT nodeType)请参阅我的例子如下:

Since I'm assuming you're also reading a large XML file, you probably want to know when you're passing to the next person tag... (via the END_ELEMENT nodeType) See my example below:

while($p->read()) {
    // check for nodeType here (opening tag only)
    if ($p->nodeType == XMLReader::ELEMENT) {
        if ($p->name == 'person') {
            if ($this->_notImported('govtrack',$p->getAttribute('id'))) {
                // $insert['indiv_*'] stuff here
            } else {
                $insert = null; // skip record because it's already imported
            }
        } else if ($p->name == 'role') {
            // role stuff here
            $startdate = $p->getAttribute('startdate');
        }

    // check for closing </person> tag here
    } else if ($p->nodeType == XMLReader::END_ELEMENT && $p->name == 'person') {
        if (isset($insert)) {
            // db insert here
        }
    }
}

顺便说一句,如果你想让它工作,你的引号 ' 必须替换为正确的引号 '.

By the way, your quotes must be replaced with proper quotes ' if you want this to work.