Python如何在HTML中删除空行
我有问题.我从html中删除了一些标签.但是我希望输出没有空行.像这样的人.
I have some problem. I remove some tag from html. But I want the output don't have empty line. Like this one.
<!DOCTYPE html>
<html itemscope="itemscope" itemtype="http://schema.org/WebPage" lang="id-ID">
<head>
<title>Kenya Kasat Narkoba Polres Bintan Diganti? Ini Pesan Kapolres melada Kasatreskrim Baru - Tribun Batam</title>
</head>
<body id="bodyart">
<div id="skinads" style="position:fixed;width:100%;">
<div class="main">
<div class="f1" style="height:600px;width:90px;left:-97px:position:relative;text-align:right;z-index:999999">
<div id="div-Left-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
<div class="fr" style="height:600px;width;90px;right:-97px;position:relative;text-align:left;z-index:999999">
<div id="div-Right-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
</div>
<div class="cl2"></div>
</div>
<div id="fb-root"></div>
我的预期输出是
<!DOCTYPE html>
<html itemscope="itemscope" itemtype="http://schema.org/WebPage" lang="id-ID">
<head>
<title>Kenya Kasat Narkoba Polres Bintan Diganti? Ini Pesan Kapolres melada Kasatreskrim Baru - Tribun Batam</title>
</head>
<body id="bodyart">
<div id="skinads" style="position:fixed;width:100%;">
<div class="main">
<div class="f1" style="height:600px;width:90px;left:-97px:position:relative;text-align:right;z-index:999999">
<div id="div-Left-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
<div class="fr" style="height:600px;width;90px;right:-97px;position:relative;text-align:left;z-index:999999">
<div id="div-Right-Skin" style="width:90px; height:600px;display:none">
</div>
</div>
</div>
<div class="cl2"></div>
</div>
<div id="fb-root"></div>
如何在html中删除空行?我可以用beautifulsoup吗?还是任何图书馆?
How to remove empty line in html? Can I use beautifulsoup? Or any library?
更新
我尝试将我的代码与@elethan的答案结合起来,但是出现了一些错误
i try to combine my code with @elethan 's anwer but i got some error
我的代码是
from list import get_filepaths
from bs4 import BeautifulSoup
from bs4 import Comment
filenames = get_filepaths(r"C:\Coba")
index = 0
for f in filenames:
file_html=open(str(f),"r")
soup = BeautifulSoup(file_html,"html.parser")
[x.extract() for x in soup.find_all('script')]
[x.extract() for x in soup.find_all('style')]
[x.extract() for x in soup.find_all('meta')]
[x.extract() for x in soup.find_all('noscript')]
[x.extract() for x in soup.find_all(text=lambda text:isinstance(text, Comment))]
index += 1
stored_file = "PreProcessing\extracts" + '{0:03}'.format(index) + ".html"
filewrite = open(stored_file, "w")
filewrite.write(str(soup) + '\n')
with open(stored_file, 'r+') as f:
lines = [i for i in f.readlines() if i and i != '\n']
f.seek(0)
f.writelines(lines)
f.truncate()
filewrite.close
但是我得到了这样的输出(对不起,无法粘贴html)实际上,它在开始时就不错了,但在结尾处几乎都是nul nul nul(如屏幕截图).
but i got the output like this (sorry cant paste the html) actually its pretty good in the begining but almost the ending there nul nul nul (like the screenshoot).
如何删除nul nul nul?
how to remove the nul nul nul?
在您的代码中,首先从文件中删除所有多余的换行符:
In your code, first remove all the extra newlines from the file:
with open(my_html_file) as f:
lines = [i for i in f.readlines() if i and i != '\n']
然后将过滤后的文本写回到文件中
Then write the filtered text back to the file:
with open(my_html_file, 'w') as f:
f.writelines(lines)
或者在带有 with
块的单个代码中完成整个操作:
Or to do the whole thing in a single with
block:
with open(my_html_file, 'r+') as f:
lines = [i for i in f.readlines() if i and i != '\n']
f.seek(0)
f.writelines(lines)
f.truncate()
根据您现有的代码(应该在问题中添加),您可以简单地将我代码的过滤部分添加到已有的代码中.
Depending on your existing code (which you should add to your question), you might be able to simply add the filtering part of my code to what you have.