使用xpath从网页刮取特定文本
I've searched and tried multiple ways to get this but I'm not sure why it won't find most of the information on the webpage.
Page to scrape: https://m.safeguardproperties.com/
Info needed: Version number for PhotoDirect for Apple (currently 4.4.0)
Xpath to text needed (I think) : /html/body/div[1]/div[2]/div[1]/div[4]/div[3]/a
Attempts:
<?php
$file = "https://m.safeguardproperties.com/";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("/html/body/div[1]/div[2]/div[1]/div[4]/div[3]/a");
echo "<PRE>";
if (!is_null($elements)) {
foreach ($elements as $element) {
var_dump ($element);
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "
";
}
}
}
echo "</PRE>";
?>
Second Attempt:
<?PHP
$file = "https://m.safeguardproperties.com/";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
echo '<pre>';
// trying to find all links in document to see if I can see the correct one
$links = [];
$arr = $doc->getElementsByTagName("a");
foreach($arr as $item) {
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[
]+/", " ", $item->nodeValue));
$links[] = [
'href' => $href,
'text' => $text
];
}
var_dump($links);
echo '</pre>';
?>
我已经搜索并尝试了多种方法来获取此信息,但我不确定为什么它找不到 网页上的信息。 p>
要刮的页面: https:/ /m.safeguardproperties.com/ p>
需要的信息: PhotoDirect for Apple的版本号(目前为4.4.0) p>
Xpath 需要的文字(我认为):/ html / body / div [1] / div [2] / div [1] / div [4] / div [3] / a p>
尝试: p>
&lt;?php
$ file =“https://m.safeguardproperties.com/";
$doc = new DOMDocument();
$ doc-&gt; loadHTMLFile($ file);
$ xpath = new DOMXpath($ doc);
$ elements = $ xpath-&gt; query(“/ html / body / div [1 ] / div [2] / div [1] / div [4] / div [3] / a“);
echo”&lt; PRE&gt;“;
if(!is_null($ elements)){
foreach($ elements as $ element){
var_dump($ element);
echo“&lt; br /&gt; [”。 $元素 - &GT;节点名称。 “]”;
$ nodes = $ element-&gt; childNodes;
foreach($ nodes as $ node){
echo $ node-&gt; nodeValue。 “
”;
}
}
}
echo“&lt; / PRE&gt;”;
?&gt;
code> pre>
第二次尝试: p>
&lt;?PHP
$ file =“https://m.safeguardproperties.com/";
$doc = new DOMDocument(); \ n $ doc-&gt; loadHTMLFile($ file);
echo'&lt; pre&gt;';
//尝试查找文档中的所有链接,看看我是否能看到正确的链接
$ links = [];
$ arr = $ doc-&gt; getElementsByTagName(“a”);
foreach($ arr as $ item){
$ href = $ item-&gt; getAttribute(“href”);
$ text = trim(preg_replace(“/ [
] + /”,“”,$ item-&gt; nodeValue));
$ links [] = [
'href'=&gt; $ href,
'text'=&gt; $ text
];
}
var_dump($ links);
echo'&lt; / pre&gt;';
?&gt;
code> pre>
div>
For that particular website, the versions are being loaded from JSON data client side, you won't find them in the base document.
http://m.safeguardproperties.com/js/photodirect.json
This was located by comparing the original document source to the finished DOM and inspecting the network activity in the developer console.
$url = 'https://m.safeguardproperties.com/js/photodirect.json';
$json = file_get_contents( $url );
$object = json_decode( $json );
echo $object->ios->version; //4.4.0
Please respect other websites and cache your GET request.