使用PHP Simple HTML DOM Parser从html中提取dom元素
I'm trying to extract links to the articles including the text, from this site using PHP Simple HTML DOM PARSER.
I want to extract all h2
tags for articles in the main page and I'm trying to do it this way:
$html = file_get_html('http://www.winbeta.org');
$articles = $html->getElementsByTagName('article');
$a = null;
foreach ($articles->find('h2') as $header) {
$a[] = $header;
}
print_r($a);
according to the manual it should first get all the content inside article
tags then for each article extract the h2 and save in array. but instead it gives me :
我正在尝试从此网站使用 PHP Simple HTML DOM PARSER 。 p>
我想提取所有 根据手册它应该首先得到所有的 h2 code>主页中文章的标签,我试图这样做: p>
$ html = file_get_html('http://www.winbeta。 org');
$ articles = $ html-> getElementsByTagName('article');
$ a = null;
foreach($ articles-> find('h2')as $ header){
$ a [] = $ header;
}
print_r($ a);
code> pre>
article code>标签内的内容然后为每篇文章e 提取h2并保存在数组中。 但它给了我: p>
There are several problems:
-
getElementsByTagName
apparently returns a single node, not an array, so it would not work if you have more than one article tag on the page. Instead usefind
which does return an array; - But once you make that switch, you cannot use
find
on a result offind
, so you should do that on each individual matched article tag, or better use a combined selector as argument tofind
; -
Main issue: You must retrieve the text content of the node explicitly with
->plaintext
, otherwise you get the object representation of the node, with all its attributes and internals; - Some of the text contains HTML entities like
’
. These can be decoded withhtml_entity_decode
.
So this code should work:
$a = array();
foreach ($html->find('article h2') as $h2) { // any h2 within article
$a[] = html_entity_decode($h2->plaintext);
}
Using array_map
, you could also do it like this:
$a = array_map(function ($h2) { return html_entity_decode($h2->plaintext); },
$html->find('article h2'));
If you need to retrieve other tags within articles as well, to store their texts in different arrays, then you could do as follows:
$a = array();
$b = array();
foreach ($html->find('article') as $article) {
foreach ($article->find('h2') as $h2) {
$a[] = html_entity_decode($h2->plaintext);
}
foreach ($article->find('h3') as $h3) {
$b[] = html_entity_decode($h3->plaintext);
}
}