使用python从网页中提取信息

使用python从网页中提取信息

问题描述:

是否可以从以下网页中使用python提取得分/目标/反对: http://www.uscho.com/standings/division-i-men/2011-2012/吗?我的问题在于表的结构很时髦.有什么资源可以帮助我解决问题吗?

Would it be possible to extract scores and goals for/against with python from a webpage like: http://www.uscho.com/standings/division-i-men/2011-2012/ ? My problem lies in the fact that the tables are structured funky. Is there any resource that could help me out with my problem?

使用 lxml .

这是一个入门的基本脚本:

Here's a basic script to get you started:

from urllib2 import urlopen
from lxml import etree

url = 'http://www.uscho.com/standings/division-i-men/2011-2012/'

tree = etree.HTML(urlopen(url).read())

for section in tree.xpath('//section[starts-with(@id, "section_")]'):
    print section.xpath('h3[1]/text()')[0]
    for row in section.xpath('table/tbody/tr'):
        cols = row.xpath('td//text()')
        print '  ', cols[0].ljust(25), ' '.join(cols[1:])
    print

输出:

Atlantic Hockey
   Air Force                 8 2 1 .773 17 40-26 9 4 2 .667 53-36 6 0 1 3 3 1
   Mercyhurst                6 1 2 .778 14 21-15 7 7 2 .500 36-49 5 1 1 2 4 1
   RIT                       5 3 2 .600 12 24-20 6 5 2 .538 30-32 5 2 2 1 3 0
   Robert Morris             5 2 1 .688 11 31-20 7 6 1 .536 44-43 3 2 1 3 3 0
   Bentley                   4 3 2 .556 10 25-18 4 8 3 .367 35-43 1 2 2 3 6 1
   Canisius                  4 3 2 .556 10 16-17 4 8 3 .367 23-41 2 2 1 2 6 2
   Holy Cross                5 4 0 .556 10 28-26 7 7 0 .500 40-47 5 1 0 2 6 0
   Niagara                   3 2 4 .556 10 25-22 4 5 5 .464 36-39 1 2 2 3 3 3
   Connecticut               4 5 1 .450 9 30-24 5 8 2 .400 41-42 3 1 0 1 7 2
   American International    2 7 2 .273 6 24-36 3 12 2 .235 35-58 1 4 2 2 8 0
   Army                      1 5 4 .300 6 20-33 1 7 6 .286 26-47 0 4 2 1 3 3
   Sacred Heart              0 10 1 .045 1 30-57 1 14 1 .094 39-86 0 5 1 0 9 0

CCHA
   Ohio State                9 2 1 1 .792 29 42-26 12 3 1 .781 53-31 6 1 1 6 2 0
   Notre Dame                7 2 3 0 .708 24 36-28 10 5 3 .639 55-50 6 3 0 4 2 3
   Western Michigan          6 4 2 2 .583 22 33-28 8 4 4 .625 49-34 5 2 1 3 2 3
   Lake Superior             6 5 1 1 .542 20 31-32 10 6 2 .611 46-43 5 3 0 5 3 2
   Ferris State              6 5 1 0 .542 19 28-27 10 5 1 .656 43-30 5 1 1 5 4 0
   Michigan State            6 4 0 0 .600 18 32-23 10 5 1 .656 56-41 6 1 1 3 3 0
   Northern Michigan         4 5 3 2 .458 17 28-31 7 6 3 .531 41-40 6 1 3 1 5 0
   Miami                     4 6 2 1 .417 15 26-31 8 8 2 .500 48-48 3 3 2 4 5 0
   Michigan                  4 6 2 1 .417 15 36-32 8 8 2 .500 64-47 7 5 0 1 3 2
   Alaska                    4 8 2 0 .357 14 26-33 7 9 2 .444 39-41 4 5 1 2 3 1
   Bowling Green             1 10 1 1 .125 5 14-41 6 10 2 .389 32-49 3 6 1 3 4 1

D-I Independent
   Alabama-Huntsville        0 0 0 .000 0 - 1 15 1 .088 16-67 1 8 1 0 7 0

ECAC
   Cornell                   6 1 1 .812 13 26-11 7 3 1 .682 32-18 4 1 1 3 1 0
   Colgate                   6 2 0 .750 12 28-15 11 4 1 .719 55-36 5 2 0 5 2 0
   Clarkson                  3 4 2 .444 8 19-18 9 6 4 .579 55-37 6 2 0 3 3 4
   St. Lawrence              4 5 0 .444 8 16-22 5 10 0 .333 31-52 3 6 0 2 4 0
   Union                     3 2 2 .571 8 16-13 7 3 5 .633 49-29 1 2 2 6 1 3
   Yale                      4 2 0 .667 8 19-15 6 4 1 .591 36-31 3 2 0 3 1 0
   Dartmouth                 3 3 1 .500 7 18-22 4 5 1 .450 24-30 3 3 1 1 2 0
   Princeton                 3 5 1 .389 7 23-30 4 7 2 .385 30-39 2 2 1 1 4 0
   Quinnipiac                2 4 3 .389 7 18-22 9 6 3 .583 57-40 6 1 2 3 5 1
   Brown                     3 3 0 .500 6 19-20 4 6 1 .409 24-30 2 2 0 1 4 1
   Harvard                   2 3 2 .429 6 20-21 3 3 3 .500 31-31 2 2 1 1 1 2
   Rensselaer                1 6 0 .143 2 8-21 3 12 0 .200 18-42 2 5 0 1 7 0

Hockey East
   Boston College            9 3 0 .750 18 45-29 12 5 0 .706 63-42 5 3 0 6 2 0
   Boston University         6 4 1 .591 13 37-34 8 5 1 .607 47-43 5 3 0 2 2 1
   Merrimack                 6 2 1 .722 13 23-18 9 2 1 .792 37-20 4 1 1 5 1 0
   Massachusetts-Lowell      6 3 0 .667 12 33-27 9 4 0 .692 46-33 4 1 0 5 2 0
   Providence                6 4 0 .600 12 37-29 8 7 1 .531 51-47 7 2 1 1 3 0
   Maine                     5 5 1 .500 11 37-35 6 6 2 .500 45-44 4 3 0 2 3 2
   New Hampshire             4 6 1 .409 9 31-37 6 8 2 .438 56-56 6 2 0 0 6 2
   Northeastern              3 7 2 .333 8 31-35 6 7 2 .467 46-39 2 2 1 4 5 1
   Massachusetts             2 6 3 .318 7 29-39 4 7 4 .400 47-52 4 0 3 0 7 1
   Vermont                   1 8 1 .150 3 22-42 3 10 1 .250 33-59 2 5 1 1 5 0

WCHA
   Minnesota                 10 2 0 .833 20 43-23 13 4 1 .750 75-36 8 1 0 5 3 1
   Minnesota-Duluth          9 2 1 .792 19 52-27 11 3 2 .750 66-39 7 3 0 4 0 2
   Nebraska-Omaha            6 3 3 .625 15 44-41 8 7 3 .528 60-58 5 2 1 3 4 2
   Colorado College          6 4 0 .600 12 44-36 8 4 0 .667 52-38 5 0 0 3 4 0
   North Dakota              6 6 0 .500 12 37-35 8 7 1 .531 49-48 5 2 1 3 5 0
   Denver                    4 3 3 .550 11 39-34 6 5 3 .536 51-44 5 2 2 1 3 1
   Michigan Tech             5 6 1 .458 11 36-35 8 7 1 .531 48-43 6 3 1 2 4 0
   St. Cloud State           4 5 3 .458 11 36-37 6 8 4 .444 57-58 3 1 3 2 7 1
   Bemidji State             4 6 2 .417 10 32-42 6 8 2 .438 43-52 3 2 1 3 6 1
   Wisconsin                 4 7 1 .375 9 35-43 7 8 1 .469 52-52 7 3 0 0 5 1
   Alaska-Anchorage          2 9 1 .208 5 20-47 5 9 2 .375 37-56 2 5 1 1 4 1
   Minnesota State           2 9 1 .208 5 34-52 3 12 1 .219 39-64 1 4 1 2 8 0