为什么BS4返回标签,然后一个空列表这个find_all()方法?

为什么BS4返回标签,然后一个空列表这个find_all()方法?

问题描述:

看着美国人口普查QFD 我试图抢竞赛%的县。我建立循环是我的问题,它涉及该code的范围:

Looking at US Census QFD I'm trying to grab the race % by county. The loop I'm building is outside the scope of my question, which concerns this code:

url = 'http://quickfacts.census.gov/qfd/states/48/48507.html'
#last county in TX; for some reason the qfd #'s counties w/ only odd numbers
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

c_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[0] #c = county %
s_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[1] #s = state %

这抓住了HTML元素包括它的标签,而不仅仅是其中的文本:

Which grabs the html element including its tags, not just the text within it:

c_black_alone, s_black_alone

(<td align="right" headers="rp9 p1" valign="bottom">96.9%<sup></sup></td>,
 <td align="right" headers="rp9 p2" valign="bottom">80.3%<sup></sup></td>)

以上^,我只想要的元素里面%的...

Above ^, I only want the %'s inside the elements...

此外,为什么

test_black = soup.find_all("td", text = "Black")

不返回相同的元素以上(或它的文本),而是返回一个空的BS4 ResultSet对象? (编辑:我一直在与文档进行操作,所以我希望这个问题似乎并不过于含​​糊......)

not return the same element as above (or its text), but instead returns an empty bs4 ResultSet object? ( I have been following along with the documentation, so I hope this question doesn't seem too vague...)

要获得这些比赛的文本,使用的.text 来获得的所有包含文本:

To get the text from those matches, use .text to get all contained text:

>>> soup.find_all("td", attrs={'headers':'rp9'})[0].text
u'96.9%'
>>> soup.find_all("td", attrs={'headers':'rp9'})[1].text
u'80.3%'

文本搜索不会,原因有两个匹配任何

Your text search doesn't match anything for two reasons:


  1. 一个字符串只匹配的全部的包含的文本,而不是部分匹配。 TD&GT;黑&LT; / TD&GT;它只会为元素和&LT工作作为的唯一的内容

  2. 它将使用 .string 属性,但财产如果文本是一个给定元素的只有的孩子时,才设置。如果有其他元素present,搜索将完全失败。

  1. A literal string only matches the whole contained text, not a partial match. It'll only work for element with <td>Black</td> as the sole contents.
  2. It will use the .string property, but that property is only set if the text is the only child of a given element. If there are other elements present, the search will fail entirely.

解决这个问题的方法是使用一个lambda代替;它会被传递整个元素,你可以验证每一个元素:

The way around this is by using a lambda instead; it'll be passed the whole element and you can validate each element:

soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)

演示:

>>> soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)
[<td id="rp10" valign="top">Black or African American alone, percent, 2013 (a)  <!-- RHI225213 --> </td>, <td id="re6" valign="top">Black-owned firms, percent, 2007  <!-- SBO315207 --> </td>]

这两个比赛都在注释中&LT; TD&GT; 元素,使得搜索与文本比赛无效的。

Both of these matches have a comment in the <td> element, making a search with a text match ineffective.