使用grequest向sourceforge发出数千个获取请求,获取"URL超过最大重试次数".
我对所有这一切都很陌生;我需要为我正在写的论文获取数千个Sourceforge项目的数据.可以在url http://sourceforge.net/api/project/name/[项目名称]/json上以json格式免费获得所有数据.我有数千个这些URL的列表,并且正在使用以下代码.
I am very new to all of this; I need to obtain data on several thousand sourceforge projects for a paper I am writing. The data is all freely available in json format at the url http://sourceforge.net/api/project/name/[project name]/json. I have a list of several thousand of these URL's and I am using the following code.
import grequests
rs = (grequests.get(u) for u in ulist)
answers = grequests.map(rs)
使用此代码,我可以获取我喜欢的任何200个左右项目的数据,即rs = (grequests.get(u) for u in ulist[0:199])
可以工作,但是一旦我查看了所有尝试,便可以满足所有要求
Using this code I am able to obtain the data for any 200 or so projects I like, i.e. rs = (grequests.get(u) for u in ulist[0:199])
works, but as soon as I go over that, all attempts are met with
ConnectionError: HTTPConnectionPool(host='sourceforge.net', port=80): Max retries exceeded with url: /api/project/name/p2p-fs/json (Caused by <class 'socket.gaierror'>: [Errno 8] nodename nor servname provided, or not known)
<Greenlet at 0x109b790f0: <bound method AsyncRequest.send of <grequests.AsyncRequest object at 0x10999ef50>>(stream=False)> failed with ConnectionError
在退出python之前,我无法再发出任何请求,但是一旦我重新启动python,就可以再发出200个请求.
I am then unable to make any more requests until I quit python, but as soon as I restart python I can make another 200 requests.
我尝试使用grequests.map(rs,size=200)
,但这似乎无济于事.
I've tried using grequests.map(rs,size=200)
but this seems to do nothing.
就我而言,这不是目标服务器的速率限制,而是简单得多:我没有明确关闭响应,因此它们保留了响应.套接字打开,并且python进程用尽了文件句柄.
In my case, it was not rate limiting by the destination server, but something much simpler: I didn't explicitly close the responses, so they were keeping the socket open, and the python process ran out of file handles.
我的解决方案(不确定是哪个问题解决了-理论上应该解决)应该是:
My solution (don't know for sure which one fixed the issue - theoretically either of them should) was to:
-
在
grequests.get
中设置stream=False
:
rs = (grequests.get(u, stream=False) for u in urls)
在阅读response.content后明确调用response.close()
:
responses = grequests.map(rs)
for response in responses:
make_use_of(response.content)
response.close()
注意:仅仅破坏response
对象(将None
分配给它,调用gc.collect()
)是不够的-这并没有关闭文件句柄.
Note: simply destroying the response
object (assigning None
to it, calling gc.collect()
) was not enough - this did not close the file handles.