使用DOMDocument解析HTML时的Rogue元素

使用DOMDocument解析HTML时的Rogue元素

问题描述:

Lets assume my $html looks like this:

<!DOCTYPE html>
<html>
<head>
    <script type="text/javascript">document.createElement("video");document.createElement("audio");document.createElement("track");</script>
    <script type="text/javascript" src="/gui/default/tinymcecontent.js"></script>
    <script type="text/javascript" src="/includes/js/video-js/video.min.js"></script>
    <link rel="stylesheet" href="/includes/js/video-js/video-js.css" />
    <script type="text/javascript">document.createElement("video");document.createElement("audio");document.createElement("track");</script>
    <script type"text/javascript" src="/includes/js/video-js/video.js"></script/>
    <link rel="stylesheet" href="/includes/js/video-js/video-js.css" />
</head>
<body style="font-family: arial;font-size: 12px;">
    <p> </p>
    <table width="100%">        
    </table>
</body>
</html>

When I try to parse only elements, that are inside body tag with commands:

$dom = new DOMDocument();

libxml_use_internal_errors(true);
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
libxml_use_internal_errors(false);

$full_dom = $dom->getElementsByTagName('body')->item(0);

The result of

$dom->saveHTML($full_dom)

is

<body>
<p>\/&gt;<link rel=\"stylesheet\" href=\"\/includes\/js\/video-js\/video-js.css\"><\/p>
<p>\u00a0<\/p>
<table width=\"100%\"><\/table>
<\/body>

Element

<p>\/&gt;<link rel=\"stylesheet\" href=\"\/includes\/js\/video-js\/video-js.css\"><\/p>

comes from where? Everything else is good, just this element gets transfered from head tag into elements of body tag..

假设我的$ html看起来像这样: p>

 &lt  ;!DOCTYPE html&gt; 
&lt; html&gt; 
&lt; head&gt; 
&lt; script type =“text / javascript”&gt; document.createElement(“video”); document.createElement(“audio”); document.createElement  (“track”);&lt; / script&gt; 
&lt; script type =“text / javascript”src =“/ gui / default / tinymcecontent.js”&gt;&lt; / script&gt; 
&lt; script type =“  text / javascript“src =”/ includes / js / video-js / video.min.js“&gt;&lt; / script&gt; 
&lt; link rel =”stylesheet“href =”/ includes / js / video-js  /video-js.css“/&gt; 
&lt; script type =”text / javascript“&gt; document.createElement(”video“); document.createElement(”audio“); document.createElement(”track“)  ;&lt; / script&gt; 
&lt;脚本类型“text / javascript”src =“/ includes / js / video-js / video.js”&gt;&lt; / script /&gt; 
&lt; link rel =“  stylesheet“href =”/ includes / js / video-js / video-js.css“/&gt; 
&lt; / head&gt; 
&lt; body style =”font-family:arial; font-size:12px;“&gt  ; 
&lt; p&gt;  &lt; / p&gt; 
&lt; table width =“100%”&gt;  
&lt; / table&gt; 
&lt; / body&gt; 
&lt; / html&gt; 
  code>  pre> 
 
 

当我尝试仅解析body标签内的元素时 命令: p>

  $ dom = new DOMDocument(); 
 
libxml_use_internal_errors(true); 
 $ dom-&gt; loadHTML(mb_convert_encoding($ html,'HTML-ENTITIES  ','UTF-8')); 
libxml_use_internal_errors(false); 
 
 $ full_dom = $ dom-&gt; getElementsByTagName('body') - &gt; item(0); 
  code>   pre> 
 
 

p>

  $ dom-&gt; saveHTML($ full_dom)
  code>  pre> 
 
的结果 

是 p>

 &lt; body&gt; 
&lt; p&gt; \ /&amp; gt;&lt; link rel = \“stylesheet \”href = \“\ / includes  \ / js \ / video-js \ /video-js.css \“&gt;&lt; \ / p&gt; 
&lt; p&gt; \ u00a0&lt; \ / p&gt; 
&lt; table width = \”100%\“&gt  ;&LT; \ /表&gt; 
&LT; \ /体&GT; 
 代码>  PRE> 
 
 

元素 p>

 &LT; p为H.  \ /&amp; gt;&lt; link rel = \“stylesheet \”href = \“\ / includes \ / js \ / video-js \ /video-js.css \”&gt;&lt; \ / p&gt; 
   code>  pre> 
 
 

来自哪里? 其他一切都很好,只是这个元素被转移了 m标签到身体标签的元素.. p> div>

It comes from the line :

<script type"text/javascript" src="/includes/js/video-js/video.js"></script/>

It is badly formed and should be :

<script type="text/javascript" src="/includes/js/video-js/video.js"></script>

You have to check errors after $dom->loadHTML() to see what's happend :

foreach (libxml_get_errors() as $error) {
    print_r($error);
}