从html页面中删除所有样式,脚本和html标签
问题描述:
这是我到目前为止:
Here is what I have so far:
from bs4 import BeautifulSoup
def cleanme(html):
soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
for script in soup(["script"]):
script.extract()
text = soup.get_text()
return text
testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"
cleaned = cleanme(testhtml)
print (cleaned)
这是删除脚本
答
它看起来像你几乎拥有它。您还需要删除html标签和css样式代码。这里是我的解决方案(我更新了函数):
It looks like you almost have it. You need to also remove the html tags and css styling code. Here is my solution (I updated the function):
def cleanMe(html):
soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
for script in soup(["script", "style"]): # remove all javascript and stylesheet code
script.extract()
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
return text