php - 检测字符串中的HTML并使用代码标记进行换行

问题描述:

I'm in a trouble with treating HTML in text content. I'm thinking about a method that detects those tags and wrap all consecutive one inside code tags.

Don't wrap me<p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span> Don't wrap me <h1>End</h1>.

//expected result

Don't wrap me<code><p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span></code>Don't wrap me <code><h1>End</h1></code>.

Is this possible?

我在处理文本内容中的HTML方面遇到了麻烦。 我正在考虑一种方法来检测这些标签并将所有连续的标签包裹在代码标签中。 p>

不要包裹我&lt; p&gt; Hello&lt; / p&gt;&lt; div class =“text”&gt;请包装我!&lt; / div&gt;&lt; span class =“title”&gt;包装我!&lt; / span&gt; code>不包裹我&lt; h1&gt;结束&lt; / h1&gt; code>。 p>

//预期结果 p>

不要包裹&lt; code&gt; &lt; p&gt; Hello&lt; / p&gt;&lt; div class =“text”&gt;请给我打包!&lt; / div&gt;&lt; span class =“title”&gt;包装我!&lt; / span&gt;&lt; / 代码&gt; code>请勿包装&lt; code&gt;&lt; h1&gt; End&lt; / h1&gt;&lt; / code&gt; code>。 p>

这是 可能? p> div>

It is hard to use DOMDocument in this specific case, since it wraps automatically text nodes with <p> tags (and add doctype, head, html). A way is to construct a pattern as a lexer using the (?(DEFINE)...) feature and named subpatterns:

$html = <<<EOD
Don't wrap me<p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span> Don't wrap me <h1>End</h1>
EOD;

$pattern = <<<'EOD'
~
(?(DEFINE)
    (?<self>    < [^\W_]++ [^>]* > )
    (?<comment> <!-- (?>[^-]++|-(?!->))* -->)
    (?<cdata>   \Q<![CDATA[\E (?>[^]]++|](?!]>))* ]]> )
    (?<text>    [^<]++ )
    (?<tag>
        < ([^\W_]++) [^>]* >
        (?> \g<text> | \g<tag> | \g<self> | \g<comment> | \g<cdata> )*
        </ \g{-1} >
    )
)
# main pattern
(?: \g<tag> | \g<self> | \g<comment> | \g<cdata> )+
~x
EOD;

$html = preg_replace($pattern, '<code>$0</code>', $html);

echo htmlspecialchars($html);

The (?(DEFINE)..) feature allows to put a definition section inside a regex pattern. This definition section and the named subpatterns inside don't match nothing, they are here to be used later in the main pattern.

(?<abcd> ...) defines a subpattern you can reuse later with \g<abcd>. In the above pattern, subpatterns defined in this way are:

  • self: that describes a self-closing tag
  • comment: for html comments
  • cdata: for cdata
  • text: for text (all that is not a tag, a comment, or cdata)
  • tag: for html tags that are not self-closed

self:
[^\W_] is a trick to obtain \w without the underscore. [^\W]++ represents the tag name and is used too in the tag subpattern.
[^>]* means all that is not a > zero or more times.

comment:
(?>[^-]++|-(?!->))* describes all the possible content inside an html comment:

(?>          # open an atomic group
    [^-]++   # all that is not a literal -, one or more times (possessive)
  |          # OR
    -        # a literal -
    (?!->)   # not followed by -> (negative lookahead)
)*           # close and repeat the group zero or more times 

cdata:
All characters between \Q..\E are seen as literal characters, special characters like [ don't need to be escaped. (This only a trick to make the pattern more readable).
The content allowed in CDATA is described in the same way than the content in html comments.

text:
[^<]++ all characters until an opening angle bracket or the end of the string.

tag:
This is the most insteresting subpattern. Lines 1 and 3 are the opening and the closing tag. Note that, in line 1, the tag name is captured with a capturing group. In line 3, \g{-1} refers to the content matched by the last defined capturing group ("-1" means "one on the left").
The line 2 describes the possible content between an opening and a closing tag. You can see that this description use not only subpatterns defined before but the current subpattern itself to allow nested tags.

Once all items have been set and the definition section closed, you can easily write the main pattern.

I'm in a trouble with treating HTML in text content.

then just escape that text:

echo htmlspecialchars($your_text_that_may_contain_html_code);

parsing html with regex is a well-known-big-NO!

This will find tags along with their closing tags, and everything in between:

<[A-Z][A-Z0-9]*\b[^>]*>.*?</\1>

You might be able to capture those tags and replace them with the tags around them. It may not work with every case, but you might find it sufficient for your needs if the html is fairly static.