使用PHP Simple HTML DOM Parser从html中提取dom元素

问题描述：

I'm trying to extract links to the articles including the text, from this site using PHP Simple HTML DOM PARSER.

I want to extract all h2 tags for articles in the main page and I'm trying to do it this way:

    $html = file_get_html('http://www.winbeta.org');
    $articles = $html->getElementsByTagName('article');
    $a = null;

    foreach ($articles->find('h2') as $header) {
                $a[] = $header;
    }

    print_r($a);

according to the manual it should first get all the content inside article tags then for each article extract the h2 and save in array. but instead it gives me :

EDIT

我正在尝试从此网站使用 PHP Simple HTML DOM PARSER 。 p>

我想提取所有h2 code>主页中文章的标签，我试图这样做： p>

  $ html = file_get_html（'http：//www.winbeta。  org'）; 
 $ articles = $ html-＆gt; getElementsByTagName（'article'）; 
 $ a = null; 
 
 foreach（$ articles-＆gt; find（'h2'）as $ header）{  
 $ a [] = $ header; 
} 
 
 print_r（$ a）; 
  code>  pre> 
 
 根据手册它应该首先得到所有的  article  code>标签内的内容然后为每篇文章e 提取h2并保存在数组中。 但它给了我： p> 
 
 
    p> 
 
 

 编辑 strong> 
     p> 
  div>

答

There are several problems:

getElementsByTagName apparently returns a single node, not an array, so it would not work if you have more than one article tag on the page. Instead use find which does return an array;
But once you make that switch, you cannot use find on a result of find, so you should do that on each individual matched article tag, or better use a combined selector as argument to find;
Main issue: You must retrieve the text content of the node explicitly with ->plaintext, otherwise you get the object representation of the node, with all its attributes and internals;
Some of the text contains HTML entities like ’. These can be decoded with html_entity_decode.

So this code should work:

$a = array();
foreach ($html->find('article h2') as $h2) { // any h2 within article
    $a[] = html_entity_decode($h2->plaintext);
}

Using array_map, you could also do it like this:

$a = array_map(function ($h2) { return html_entity_decode($h2->plaintext); }, 
               $html->find('article h2'));

If you need to retrieve other tags within articles as well, to store their texts in different arrays, then you could do as follows:

$a = array();
$b = array();
foreach ($html->find('article') as $article) {
    foreach ($article->find('h2') as $h2) {
        $a[] = html_entity_decode($h2->plaintext);
    }
    foreach ($article->find('h3') as $h3) {
        $b[] = html_entity_decode($h3->plaintext);
    }
}

使用PHP Simple HTML DOM Parser从html中提取dom元素

相关推荐