替换< br>在BeautifulSoap输出中有空格

问题描述：

我正在使用BeautifulSoap抓取一些链接，但是，它似乎完全忽略了< br> 标签.

I am scraping a few links with BeautifulSoap however, it seems to completely ignore <br> tags.

这是我要删除的URL的源代码的相关部分:

Here is the relevant portion of source code of the URL I am scrapping:

<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span id="something">&#xe800;</span></h1>

这是我的BeautifulSoap代码(仅相关部分)，用于在 h1 标签中获取文本:

Here is my BeautifulSoap code (relevant part only) to get the text within h1 tags:

    soup = BeautifulSoup(page, 'html.parser')
    title_box = soup.find('h1', attrs={'class': 'para-title'})
    title = title_box.text.strip()
    print title

这将提供以下输出:

    A quick brown fox jumps overthe lazy dog

我希望如此:

    A quick brown fox jumps over the lazy dog

如何在代码中用 space 替换< br> ?

答

如何将 .get_text()与分隔符参数一起使用?

How about using the .get_text() with the separator parameter?

from bs4 import BeautifulSoup

page = '''<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span>some stuff here</span></h1>'''


soup = BeautifulSoup(page, 'html.parser')
title_box = soup.find('h1', attrs={'class': 'para-title'})
title = title_box.get_text(separator=" ").strip()
print (title)

输出:

print (title)
A quick brown fox jumps over the lazy dog
 some stuff here

替换< br>在BeautifulSoap输出中有空格

相关推荐