使用Python获取Wikipedia文章
问题描述:
我尝试使用Python的urllib获取Wikipedia文章:
I try to fetch a Wikipedia article with Python's urllib:
f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")
s = f.read()
f.close()
但是,我得到的不是HTML页面,而是以下响应:错误-Wikimedia Foundation:
However instead of the html page I get the following response: Error - Wikimedia Foundation:
Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT
维基百科似乎阻止了不是来自标准浏览器的请求.
Wikipedia seems to block request which are not from a standard browser.
有人知道如何解决这个问题吗?
Anybody know how to work around this?
答
您需要使用 urllib 在 python std库以更改用户代理.
You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.
直接从例子
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()