如何使用PHP Simple HTML DOM Parser查找非超链接文本

问题描述:

I want to parse html to a dom tree, and find all the text NOT inside the <a> tags, so, I googled it, and found "PHP Simple HTML DOM Parser". It seems it can help me to parse the HTML DOM to a DOM Tree. I would like to find the text NOT inside <a> tags, but I only can find the element which is inside <a> tag. *ps: it don't support the CSS3 not selector yet. Thank you. Any one experience on this? Thank you.

我想将html解析为dom树,并查找中的所有文本&lt; a&gt; ; code>标签,所以,我google了它,并找到了“PHP Simple HTML DOM Parser”。 它似乎可以帮助我将HTML DOM解析为DOM树。 我想找到&lt; a&gt; code>标签内的文字,但我只能找到&lt; a&gt; code>标签内的元素。 * ps:它还不支持CSS3而不是选择器。 谢谢。 有没有一个经验呢? 谢谢。 p> div>

I hope I'm not misunderstanding the question, but can't you use the built-in DOM functions for PHP to find the text inside the <a> tags?

$doc = new DOMDocument();
$doc->loadHTMLFile("http://blahblah.com/blah.html");
$elem_list = $doc->getElementsByTagName("a");
foreach($elem_list as $elem)
    echo $elem->textContent;

In that case I would remove all <a> tags and their contents (for example with regular expressions) and then load the resulting HTML into your DOM parser of choice.

Update: Even better, immediately parse the HTML and use the built-in functions to remove the <a> tags, or loop through all tags and just skip the <a> tags. Regex with HTML should be avoided.

I have used this class many times. Its an excellent solution to parse html/dom in php.

$html = new simple_html_dom();
// Load your html as string
$html->load('........ HTML ..........');
$a = $html->find('a');
$text='';
for($i=0;$i<count($a);$i++)
$text.=$a[$i]->innertext;

variable $text containing all the text in a tags. Hope it will help you.