网页学习体会

  • 首页
  • 个人博客
您的位置: 首页  >  技术问答  >  使用DOMXPath清理已弃用的HTML代码(将嵌套的
标记转换为

标记)

使用DOMXPath清理已弃用的HTML代码(将嵌套的
标记转换为

标记)

分类: 技术问答 • 2022-03-09 23:28:56
问题描述:

I'm trying to read Rich Text stored in an old MS Access database into a new PHP web app. The sanitised data will be displayed to users using CKEditor, which is quite strict on parsing standards compliant HTML code. However, the data stored in MS Access is often ill-formatted or uses deprecated HTML code.

Below is an example piece of data I am trying to sanitise:

<div align="right">Previous claim $ &nbsp;&nbsp;935.00<div align="right">&nbsp;&nbsp;This claim $1,572.50</div></div>

This data is meant to be two lines of text that are right-justified, however MS Access has used the deprecated align attribute to style the <div> tags instead of a style attribute, and has incorrectly nested them when in this scenario they should be sequential.

To turn this example data into two lines of text that are both right-justified and that CKEditor will read and display as intended (i.e. text appears as right justified), I am trying to replace the <div> tags with <p> tags, and inject an inline style attribute with right text-align to replace the deprecated align attribute.

I am using PHP's DOMXPath to clean up the data, with the following code:

$dom = new DOMDocument();
$dom->loadHTML($dataForCleaning, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

foreach ($xpath->query('//div[@align]') as $node) {
    $alignment = $node->getAttribute('align');

    $newNode = $dom->createElement('p');
    $newNode->setAttribute("style", "text-align:".$alignment);
    $node->parentNode->insertBefore($newNode, $node);

    foreach ($node->childNodes as $child) {
        $newNode->appendChild($child);
    }

    $node->parentNode->removeChild($node);
}

I am using insertBefore in lieu of appendChild in trying to keep the sequence of elements the same, but this is what's causing the issues in this nested data example.

For non-nested <div> tags as the input data to be cleaned, the sanitised output html is correct. However, in this nested <div> example, the output ends up being:

<p style="text-align:right">Previous claim $ &nbsp;&nbsp;935.00</p>

Note that the second line of text (This claim...) has been removed, as it was within a nested <div> as a child to the parent <div>

I don't mind if the resultant <p> tags remain nested, as CKEditor ends up cleaning these up, but I do need to make sure I'm not losing data like this current code does.

Thanks in advance for any help and guidance. -Mark

答

There are a couple of things I've changed. The first is that rather than just append the existing node, I get it to clone the node and append the copy (in $newNode->appendChild($child->cloneNode(true));), the second thing I do is as you are moving the enclosed node, I think that the XPath is no longer pointing to this moved node. So instead of that, I check when copying the child nodes if you have the same pattern of a <div align="right"> node and if so I create a new node in the new format and add that instead...

foreach ($xpath->query('//div[@align]') as $node) {
    $alignment = $node->getAttribute('align');

    $newNode = $dom->createElement('p');
    $newNode->setAttribute("style", "text-align:".$alignment);

    $node->parentNode->insertBefore($newNode, $node);
    foreach ($node->childNodes as $child) {
        if ( $child instanceof DOMElement && $child->localName == "div"
                && $child->attributes->getNamedItem("align")->nodeValue == "right" )    {
            $subNode = $dom->createElement('p', $child->nodeValue );
            $subNode->setAttribute("style", "text-align:".$alignment);
            $newNode->appendChild($subNode);
        }
        else    {
            $newNode->appendChild($child->cloneNode(true));
        }
    }

    $node->parentNode->removeChild($node);
}

which for the sample you give will output...

<p style="text-align:right">
    Previous claim $ &nbsp;&nbsp;935.00
    <p style="text-align:right">&nbsp;&nbsp;This claim $1,572.50</p>
</p>

相关推荐

  • 使用DOMXPath清理已弃用的HTML代码(将嵌套的
    标记转换为

    标记)

  • 在PHP中清理CSV内容
  • 如何清理和简化此代码?
    网站免责声明 网站地图 最新文章 用户隐私 版权申明
本站所有数据收集于网络,如果侵犯到您的权益,请联系网站进行下架处理。   

Copyright © 2018-2021   Powered By 网页学习体会    备案号:   粤ICP备20002247号