为什么BS4返回标签，然后一个空列表这个find_all（）方法？

问题描述：

看着美国人口普查QFD 我试图抢竞赛％的县。我建立循环是我的问题，它涉及该code的范围：

Looking at US Census QFD I'm trying to grab the race % by county. The loop I'm building is outside the scope of my question, which concerns this code:

url = 'http://quickfacts.census.gov/qfd/states/48/48507.html'
#last county in TX; for some reason the qfd #'s counties w/ only odd numbers
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

c_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[0] #c = county %
s_black_alone = soup.find_all("td", attrs={'headers':'rp9'})[1] #s = state %

这抓住了HTML元素包括它的标签，而不仅仅是其中的文本：

Which grabs the html element including its tags, not just the text within it:

c_black_alone, s_black_alone

(<td align="right" headers="rp9 p1" valign="bottom">96.9%<sup></sup></td>,
 <td align="right" headers="rp9 p2" valign="bottom">80.3%<sup></sup></td>)

以上^，我只想要的元素里面％的...

Above ^, I only want the %'s inside the elements...

此外，为什么

test_black = soup.find_all("td", text = "Black")

不返回相同的元素以上（或它的文本），而是返回一个空的BS4 ResultSet对象？（编辑：我一直在与文档进行操作，所以我希望这个问题似乎并不过于含糊......）

not return the same element as above (or its text), but instead returns an empty bs4 ResultSet object? ( I have been following along with the documentation, so I hope this question doesn't seem too vague...)

答

要获得这些比赛的文本，使用的.text 来获得的所有包含文本：

To get the text from those matches, use .text to get all contained text:

>>> soup.find_all("td", attrs={'headers':'rp9'})[0].text
u'96.9%'
>>> soup.find_all("td", attrs={'headers':'rp9'})[1].text
u'80.3%'

您文本搜索不会，原因有两个匹配任何

Your text search doesn't match anything for two reasons:

一个字符串只匹配的全部的包含的文本，而不是部分匹配。 TD＆GT;黑＆LT; / TD＆GT;它只会为元素和＆LT工作作为的唯一的内容

它将使用 .string 属性，但财产如果文本是一个给定元素的只有的孩子时，才设置。如果有其他元素present，搜索将完全失败。

A literal string only matches the whole contained text, not a partial match. It'll only work for element with <td>Black</td> as the sole contents.
It will use the .string property, but that property is only set if the text is the only child of a given element. If there are other elements present, the search will fail entirely.

解决这个问题的方法是使用一个lambda代替;它会被传递整个元素，你可以验证每一个元素：

The way around this is by using a lambda instead; it'll be passed the whole element and you can validate each element:

soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)

演示：

>>> soup.find_all(lambda e: e.name == 'td' and 'Black' in e.text)
[<td id="rp10" valign="top">Black or African American alone, percent, 2013 (a)  <!-- RHI225213 --> </td>, <td id="re6" valign="top">Black-owned firms, percent, 2007  <!-- SBO315207 --> </td>]

这两个比赛都在注释中＆LT; TD＆GT; 元素，使得搜索与文本比赛无效的。

Both of these matches have a comment in the <td> element, making a search with a text match ineffective.

为什么BS4返回标签，然后一个空列表这个find_all（）方法？

相关推荐