为什么BS4返回标签,然后一个空列表这个find_all()方法?
看着美国人口普查QFD 我试图抢竞赛%的县。我建立循环是我的问题,它涉及该code的范围:
Looking at US Census QFD I'm trying to grab the race % by county. The loop I'm building is outside the scope of my question, which concerns this code:
url = 'http://quickfacts.census.gov/qfd/states/48/48507.html'
#last county in TX; for some reason the qfd #'s counties w/ only odd numbers
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
c_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[0] #c = county %
s_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[1] #s = state %
这抓住了HTML元素包括它的标签,而不仅仅是其中的文本:
Which grabs the html element including its tags, not just the text within it:
c_black_alone, s_black_alone
(<td align="right" headers="rp9 p1" valign="bottom">96.9%<sup></sup></td>,
<td align="right" headers="rp9 p2" valign="bottom">80.3%<sup></sup></td>)
以上^,我只想要的元素里面%的...
Above ^, I only want the %'s inside the elements...
此外,为什么
test_black = soup.find_all("td", text = "Black")
不返回相同的元素以上(或它的文本),而是返回一个空的BS4 ResultSet对象? (编辑:我一直在与文档进行操作,所以我希望这个问题似乎并不过于含糊......)
not return the same element as above (or its text), but instead returns an empty bs4 ResultSet object? ( I have been following along with the documentation, so I hope this question doesn't seem too vague...)
要获得这些比赛的文本,使用的.text
来获得的所有包含文本:
To get the text from those matches, use .text
to get all contained text:
>>> soup.find_all("td", attrs={'headers':'rp9'})[0].text
u'96.9%'
>>> soup.find_all("td", attrs={'headers':'rp9'})[1].text
u'80.3%'
您文本
搜索不会,原因有两个匹配任何
Your text
search doesn't match anything for two reasons:
- 一个字符串只匹配的全部的包含的文本,而不是部分匹配。 TD&GT;黑&LT; / TD&GT;它只会为元素和
&LT工作
作为的唯一的内容 - 它将使用
.string
属性,但财产如果文本是一个给定元素的只有的孩子时,才设置。如果有其他元素present,搜索将完全失败。
- A literal string only matches the whole contained text, not a partial match. It'll only work for element with
<td>Black</td>
as the sole contents. - It will use the
.string
property, but that property is only set if the text is the only child of a given element. If there are other elements present, the search will fail entirely.
解决这个问题的方法是使用一个lambda代替;它会被传递整个元素,你可以验证每一个元素:
The way around this is by using a lambda instead; it'll be passed the whole element and you can validate each element:
soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)
演示:
>>> soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)
[<td id="rp10" valign="top">Black or African American alone, percent, 2013 (a) <!-- RHI225213 --> </td>, <td id="re6" valign="top">Black-owned firms, percent, 2007 <!-- SBO315207 --> </td>]
这两个比赛都在注释中&LT; TD&GT;
元素,使得搜索与文本
比赛无效的。
Both of these matches have a comment in the <td>
element, making a search with a text
match ineffective.