PHP简单HTML DOM解析器在有效URL上返回false

问题描述:

I'm trying the following:

$url = 'https://www.tripadvisor.es/Hotels-g187514-Madrid-Hotels.html'

$ta_html = file_get_html($url);
var_dump($ta_html);

it returns false, this is working and correctly getting the html for:

$url = 'https://www.tripadvisor.es/Hotels-g294316-Lima_Lima_Region-Hotels.html#ACCOM_OVERVIEW'

My first thought was that it had a redirect but I checked the headers with curl and its 200 ok and it seemed like the same on both cases. What can be happening? how it can be solved?

This seems to be a duplicate of this problem: Simple HTML DOM returning false that is also unanswered

我正在尝试以下方法: p>

  $ url =  'https://www.tripadvisor.es/Hotels-g187514-Madrid-Hotels.html'
nnta_html = file_get_html($ url); 
var_dump($ ta_html); 
  code>  pre  > 
 
 

它返回false,这是正常工作并正确获取html: p>

  $ url ='https://www.tripadvisor.es/  Hotels-g294316-Lima_Lima_Region-Hotels.html#ACCOM_OVERVIEW'
  code>  pre> 
 
 

我的第一个想法是它有一个重定向,但我用curl检查了标题,它的200正常 两种情况看起来都一样。 可能发生什么? 它是如何解决的? p>

这似乎是这个问题的重复: 简单的HTML DOM返回错误,也没有答案 p> div>

It looks like HTML DOM parser is failing because the HTML file size is greater than the library's max file size. When you call file_get_html() it does a file size check based on it's MAX_FILE_SIZE constant. So before calling any HTML DOM parser methods, increase the max file size used by the library by calling:

define('MAX_FILE_SIZE', 1200000); // or larger if needed, default is 600000

Also as as you found out you can work around the file size check with doing this

$html = new simple_html_dom();
$html->load($str);

Use file_get_contents() instead, works for me.

$url = "https://www.tripadvisor.es/Hotels-g187514-Madrid-Hotels.html";
file_put_contents("hello.html", file_get_contents($url));

file_get_html("Hello_html");

So I found a workaround doing this:

$base = $url;
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $base);
curl_setopt($curl, CURLOPT_REFERER, $base);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$str = curl_exec($curl);
curl_close($curl);

$html = new simple_html_dom();
$html->load($str);

Truth be told I dont know exactly why this works, and what was the original problem, and I would appreciate if anyone could point that out

It looks like this is happening because of this check in simple_html_dom.php in the file_get_html() function

if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
    return false;
}

It might be that the length of the content is greater than the MAX_FILE_SIZE