使用Python获取Wikipedia文章

问题描述:

我尝试使用Python的urllib获取Wikipedia文章:

I try to fetch a Wikipedia article with Python's urllib:

f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")           
s = f.read()
f.close()

但是,我得到的不是HTML页面,而是以下响应:错误-Wikimedia Foundation:

However instead of the html page I get the following response: Error - Wikimedia Foundation:

Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT 

维基百科似乎阻止了不是来自标准浏览器的请求.

Wikipedia seems to block request which are not from a standard browser.

有人知道如何解决这个问题吗?

Anybody know how to work around this?

您需要使用 urllib python std库以更改用户代理.

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.

直接从例子

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()