如何在PHP中捕获与可选空格的链接? [重复]
This question already has an answer here:
From a file_get_contents
I get the HTML code of a url.
$html = file_get_contents($url);
Now I would like to capture the href
link.
The HTML code is:
<li class="four-column mosaicElement">
<a href="https://example.com" title="Lorem ipsum">
...
</a>
</li>
<li class="four-column mosaicElement">
<a href="https://example.org" title="Lorem ipsum">
...
</a>
</li>
So I'm using this:
preg_match_all('/class=\"four-column mosaicElement\"><a href=\"(.+?)\" title=\"(.+?)"/m', $html, $urls, PREG_SET_ORDER, 0);
foreach ($urls as $key => $url) {
echo $url[1];
}
How do I solve this problem?
</div>
此问题已经存在 这里有一个答案: p>
-
如何在PHP中解析和处理HTML / XML?
\ n 30 answers
span>
li>
ul>
div>
从
file_get_contents code>我得到一个网址的HTML代码。 p>
$ html = file_get_contents($ url); code> pre>
现在我想捕获
href code>链接。 p>
HTML代码是: p>
&lt; li class =“four-column mosaicElement”&gt; &lt; a href =“https://example.com”title =“Lorem ipsum”&gt; ... &lt; / a&gt; &lt; / li&gt ; &lt; li class =“four-column mosaicElement”&gt; &lt; a href =“https://example.org”title =“Lorem ipsum”&gt; 。 .. &lt; / a&gt; &lt; / li&gt; code> pre>
所以我正在使用它: p>
preg_match_all('/ class = \“four-column mosaicElement \”&gt;&lt; a href = \“(。+?)\”title = \“(。+?)”/ m',$ html,$ urls,PREG_SET_ORDER,0); foreach($ urls as $ key =&gt; $ url){ echo $ url [1]; } code> pre>
如何解决此问题? p> div>
Here, we can also use an expression with positive lookahead and optional spaces, just in case,
(?=class="four-column mosaicElement")[\s\S]*?href="\s*(https?[^\s]+)\s*"
and our desired URLs are in this group:
(https?[^\s]+)
DEMO
TEST
$re = '/(?=class="four-column mosaicElement")[\s\S]*?href="\s*(https?[^\s]+)\s*"/m';
$str = '<li class="four-column mosaicElement">
<a href="https://example.com" title="Lorem ipsum">
...
</a>
</li>
<li class="four-column mosaicElement">
<a href="https://example.org" title="Lorem ipsum">
<li class="four-column mosaicElement">
<a href=" https://example.org " title="Lorem ipsum">
<li class="four-column mosaicElement">
<a href=" https://example.org " title="Lorem ipsum">
...
</a>
</li>
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach ($matches as $key => $url) {
echo $url[1] . "
";
}
Output
https://example.com
https://example.org
https://example.org
https://example.org
RegEx Circuit
jex.im visualizes regular expressions:
I was able to get your code working by just modify the regex pattern to the following:
class="four-column mosaicElement">\s*<a href="(.+?)" title="(.+?)"
^^^^^
Note carefully that I allow for any amount of whitespace between the class
attribute from the outer tag (<li>
) and the inner anchor.
Here is your updated script:
$html = "<li class=\"four-column mosaicElement\">
<a href=\"https://example.com\" title=\"Lorem ipsum\">
</a>
</li>
<li class=\"four-column mosaicElement\">
<a href=\"https://example.org\" title=\"Lorem ipsum\">
</a>
</li>";
preg_match_all('/class="four-column mosaicElement">\s*<a href="(.+?)" title="(.+?)"/m', $html, $urls, PREG_SET_ORDER, 0);
foreach ($urls as $key => $url) {
echo $url[1] . "
";
}
This prints:
https://example.com
https://example.org
Another option is to use DOMXPath with an xpath expression that finds all list items with both class names and then gets the anchors:
//li[contains(@class, 'four-column') and contains(@class, 'mosaicElement')]/a
For example:
$string = <<<DATA
<li class="four-column mosaicElement">
<a href="https://example.com" title="Lorem ipsum">
</a>
</li>
<li class="four-column mosaicElement">
<a href="https://example.org" title="Lorem ipsum">
</a>
</li>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($string);
$xpath = new DOMXpath($dom);
foreach($xpath->query("//li[contains(@class, 'four-column') and contains(@class, 'mosaicElement')]/a") as $v) {
echo $v->getAttribute("href") . PHP_EOL;
}
Result
https://example.com
https://example.org
See a php demo