使用grequest向sourceforge发出数千个获取请求，获取"URL超过最大重试次数".

使用grequest向sourceforge发出数千个获取请求，获取

问题描述：

我对所有这一切都很陌生；我需要为我正在写的论文获取数千个Sourceforge项目的数据.可以在url http://sourceforge.net/api/project/name/[项目名称]/json上以json格式免费获得所有数据.我有数千个这些URL的列表，并且正在使用以下代码.

I am very new to all of this; I need to obtain data on several thousand sourceforge projects for a paper I am writing. The data is all freely available in json format at the url http://sourceforge.net/api/project/name/[project name]/json. I have a list of several thousand of these URL's and I am using the following code.

import grequests
rs = (grequests.get(u) for u in ulist)
answers = grequests.map(rs)

使用此代码，我可以获取我喜欢的任何200个左右项目的数据，即rs = (grequests.get(u) for u in ulist[0:199])可以工作，但是一旦我查看了所有尝试，便可以满足所有要求

Using this code I am able to obtain the data for any 200 or so projects I like, i.e. rs = (grequests.get(u) for u in ulist[0:199]) works, but as soon as I go over that, all attempts are met with

ConnectionError: HTTPConnectionPool(host='sourceforge.net', port=80): Max retries exceeded with url: /api/project/name/p2p-fs/json (Caused by <class 'socket.gaierror'>: [Errno 8] nodename nor servname provided, or not known)
<Greenlet at 0x109b790f0: <bound method AsyncRequest.send of <grequests.AsyncRequest object at 0x10999ef50>>(stream=False)> failed with ConnectionError

在退出python之前，我无法再发出任何请求，但是一旦我重新启动python，就可以再发出200个请求.

I am then unable to make any more requests until I quit python, but as soon as I restart python I can make another 200 requests.

我尝试使用grequests.map(rs,size=200)，但这似乎无济于事.

I've tried using grequests.map(rs,size=200) but this seems to do nothing.

答

就我而言，这不是目标服务器的速率限制，而是简单得多:我没有明确关闭响应，因此它们保留了响应.套接字打开，并且python进程用尽了文件句柄.

In my case, it was not rate limiting by the destination server, but something much simpler: I didn't explicitly close the responses, so they were keeping the socket open, and the python process ran out of file handles.

我的解决方案(不确定是哪个问题解决了-理论上应该解决)应该是:

My solution (don't know for sure which one fixed the issue - theoretically either of them should) was to:

在grequests.get中设置stream=False:

 rs = (grequests.get(u, stream=False) for u in urls)

在阅读response.content后明确调用response.close():

 responses = grequests.map(rs)
 for response in responses:
       make_use_of(response.content)
       response.close()

注意:仅仅破坏response对象(将None分配给它，调用gc.collect())是不够的-这并没有关闭文件句柄.

Note: simply destroying the response object (assigning None to it, calling gc.collect()) was not enough - this did not close the file handles.

使用grequest向sourceforge发出数千个获取请求，获取"URL超过最大重试次数".

相关推荐