Wikipedia使用Python进行数据收集

问题描述:

我正在尝试从以下*页面检索3列(NFL球队,球员姓名,大学球队) .我是python的新手,一直在尝试使用beautifulsoup来完成此操作.我只需要属于QB的列,但是尽管有位置,但我什至无法获得所有列.到目前为止,这是我所拥有的,它什么也不输出,我也不完全清楚为什么.我相信这是由于a标签引起的,但我不知道要更改什么.任何帮助将不胜感激.'

I am trying to retrieve 3 columns (NFL Team, Player Name, College Team) from the following wikipedia page. I am new to python and have been trying to use beautifulsoup to get this done. I only need the columns that belong to QB's but I haven't even been able to get all the columns despite position. This is what I have so far and it outputs nothing and I'm not entirely sure why. I believe it is due to the a tags but I do not know what to change. Any help would be greatly appreciated.'

wiki = "http://en.wikipedia.org/wiki/2008_NFL_draft"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

rnd = ""
pick = ""
NFL = ""
player = ""
pos = ""
college = ""
conf = ""
notes = ""

table = soup.find("table", { "class" : "wikitable sortable" })

#print table

#output = open('output.csv','w')

for row in table.findAll("tr"):
    cells = row.findAll("href")
    print "---"
    print cells.text
    print "---"
    #For each "tr", assign each "td" to a variable.
    #if len(cells) > 1:
        #NFL = cells[1].find(text=True)
        #player = cells[2].find(text = True)
        #pos = cells[3].find(text=True)
        #college = cells[4].find(text=True)
        #write_to_file = player + " " + NFL + " " + college + " " + pos
        #print write_to_file

    #output.write(write_to_file)

#output.close()

我知道很多东西都被注释掉了,因为我试图找出故障所在.

I know a lot of it is commented it out because I was trying to find where the breakdown was.

这就是我要做的事情:

  • 找到Player Selections段落
  • 使用wikitable" rel = "noreferrer"> find_next_sibling()
  • 在其中找到所有tr标签
  • 对于每行,找到tdth标记,然后按索引获取所需的单元格
  • find the Player Selections paragraph
  • get the next wikitable using find_next_sibling()
  • find all tr tags inside
  • for every row, find td an th tags and get the desired cells by index

这是代码:

filter_position = 'QB'
player_selections = soup.find('span', id='Player_selections').parent
for row in player_selections.find_next_sibling('table', class_='wikitable').find_all('tr')[1:]:
    cells = row.find_all(['td', 'th'])

    try:
        nfl_team, name, position, college = cells[3].text, cells[4].text, cells[5].text, cells[6].text
    except IndexError:
        continue

    if position != filter_position:
        continue

    print nfl_team, name, position, college

这是输出(仅过滤四分卫):

And here is the output (only quarterbacks are filtered):

Atlanta Falcons Ryan, MattMatt Ryan† QB Boston College
Baltimore Ravens Flacco, JoeJoe Flacco QB Delaware
Green Bay Packers Brohm, BrianBrian Brohm QB Louisville
Miami Dolphins Henne, ChadChad Henne QB Michigan
New England Patriots O'Connell, KevinKevin O'Connell QB San Diego State
Minnesota Vikings Booty, John DavidJohn David Booty QB USC
Pittsburgh Steelers Dixon, DennisDennis Dixon QB Oregon
Tampa Bay Buccaneers Johnson, JoshJosh Johnson QB San Diego
New York Jets Ainge, ErikErik Ainge QB Tennessee
Washington Redskins Brennan, ColtColt Brennan QB Hawaiʻi
New York Giants Woodson, Andre'Andre' Woodson QB Kentucky
Green Bay Packers Flynn, MattMatt Flynn QB LSU
Houston Texans Brink, AlexAlex Brink QB Washington State