删除< style> ...< / style>使用html5lib或bleach的标签
我一直在使用出色的漂白库来删除错误的HTML。
I've been using the excellent bleach library for removing bad HTML.
我已经从Microsoft Word粘贴了很多HTML文档,其中包含以下内容:
I've got a load of HTML documents which have been pasted in from Microsoft Word, and contain things like:
<STYLE> st1:*{behavior:url(#ieooui) } </STYLE>
使用漂白剂(带有 style
标记不允许),让我留下:
Using bleach (with the style
tag implicitly disallowed), leaves me with:
st1:*{behavior:url(#ieooui) }
这没有帮助。漂白剂似乎只能选择以下选项:
Which isn't helpful. Bleach seems only to have options to:
- 转义标签;
- 删除标签(但不删除标签)
我正在寻找第三个选项-删除标签及其内容。
I'm looking for a third option - remove the tags and their contents.
是否可以使用漂白剂或html5lib完全删除 style
标记及其内容? html5lib的文档并不是很多帮助。
Is there any way to use bleach or html5lib to completely remove the style
tag and its contents? The documentation for html5lib isn't really a great deal of help.
原来是 lxml
是完成此任务的更好工具:
It turned out lxml
was a better tool for this task:
from lxml.html.clean import Cleaner
def clean_word_text(text):
# The only thing I need Cleaner for is to clear out the contents of
# <style>...</style> tags
cleaner = Cleaner(style=True)
return cleaner.clean_html(text)