正则表达式选择特定的html元素[Curl / PHP]

正则表达式选择特定的html元素[Curl / PHP]

问题描述:

I am trying to scrape some specific data and output them in my site.

what i want to extract-

Im using Curl in PHP and this is the regular expression im trying to use but it gives me an error Fatal error: Allowed memory size of ram bytes exhausted which means it takes lot of files.

code:

preg_match_all('!<th scope="(\b[a-zA-Z]+\b)">(\b[a-zA-Z]+\b)<\/th><td><a href="\/wiki\/(\b[a-zA-Z]+\b)" title="(\b[a-zA-Z]+\b)">(\b[a-zA-Z]+\b)<\/a>!',$result,$cap_matches);
$cap_name = array_values(array_unique($cap_matches[0]));
echo $cap_name[0];

ive tried to make regular expression only the "a ..." tag but i get lot of results back, i just want to grab the capital.

do not parse HTML with regex. use a proper HTML parser instead, like DOMDocument.

$domd = @DOMDocument::loadHTML ( $result );
unset($result);
$xp = new DOMXPath ( $domd );
$capital = $xp->query ( '//th[text()="Capital"]/following-sibling::td/a' )->item ( 0 )->getAttribute("title");
unset($domd,$xp);
var_dump ( $capital );

as for avoiding OOM errors, try wrapping your most memory hungry operations in smaller functions, letting the garbage collector clean everything on function exit, or unset() your big variables asap when they're no longer needed.. (i wouldn't normally use unset() in the code above, but since you were specifically complaining about OOM errors, i did). another obvious solution is to increase the memory limit, eg

if(false===ini_set("memory_limit","1G")){
    throw new \RuntimeException('error, unable to change memory limit!');
};

should set the memory limit to 1 gigabyte, up from the default 128 megabytes.