asp.net 采集到的html网页内容，去除html 标签，保留 p 和 img 标签。

asp.net 采集到的html网页内容，去除html 标签，保留 p 和 img 标签。在线等~

本帖最后由 jxnkwly 于 2014-08-18 12:13:56 编辑

如题，我用asp.net 获取到的网页，我已经获取到里面一段文章内容。但是里面有各种HTML标签。怎么用正则去除这些标签呢，同时要保留img 和p 的标签。
script 和 style 我已经去除了。



 //script

 htmlCode = Regex.Replace(htmlCode, @"<script[^>]*>[\s\S]*?<\/[^>]*script>", "", RegexOptions.IgnoreCase);

 //style 

 htmlCode = Regex.Replace(htmlCode, @"<style[^>]*>[\s\S]*?<\/[^>]*style>", "", RegexOptions.IgnoreCase);

我使用



htmlCode = Regex.Replace(htmlCode, @"<(?!(img|br|p)\s+)[^<>]*?>", "", RegexOptions.IgnoreCase);

这个正则去除，发现结尾的标签也被去除了。就是 </p> 类似这种结尾的标签没了。
各位大神，help!!
------解决方案--------------------
换种思路，只取img 和p 的标签就行了



(<img[^>]*?>\s*?</p>)

------解决方案--------------------



string pattern = @"<(?!img
------解决方案--------------------
p
------解决方案--------------------
/p).*?>";   //去除所有标签，只剩img,p

str = Regex.Replace(html, pattern, string.Empty, RegexOptions.IgnoreCase);

asp.net 采集到的html网页内容，去除html 标签，保留 p 和 img 标签。

相关推荐