如何批处理使用python中的理解执行的异步Web请求?
不确定是否可行,花一些时间看似相似的问题,但仍不清楚.有关网站网址的列表,我需要以html为起点.
not sure if this is possible, spend some time looking at what seem like similar questions, but still unclear. For a list of website urls, I need to get the html as a starting point.
我有一个包含这些url列表的类,该类返回一个自定义迭代器,可帮助我遍历这些URL以获取html(以下简化)
I have a class that contains a list of these urls and the class returns a custom iterator that helps me iterate through these to get the html (simplified below)
class Url:
def __init__(self, url)
self.url = url
def fetchhtml(self)
import urllib2
response = urllib2.urlopen(self.url)
return response.read()
class MyIterator:
def __init__(self, obj):
self.obj=obj
self.cnt=0
def __iter__(self):
return self
def next(self):
try:
result=self.obj.get(self.cnt)
self.cnt+=1
return result
except IndexError:
raise StopIteration
class Urls:
def __init__(self, url_list = []):
self.list = url_list
def __iter__(self):
return MyIterator(self)
def get(self, index):
return self.list[index]
2-我希望能够使用
url_list = [url1, url2, url3]
urls = Urls(url_list)
html_image_list = {url.url: re.search('@src="([^"]+)"', url.fetchhtml()) for url in urls}
3-我遇到的问题是我想对所有请求进行批处理,而不是让fetchhtml在列表上按顺序进行操作,一旦完成,就提取图像列表.
3 - problem i have is that I want to batch all the requests rather than having fetchhtml operate sequentially on my list, and once they are done then extract the image list.
是否有实现此目标的方法,也许使用线程和队列?如果没有按顺序运行,我看不到如何使对象的列表理解能力像这样工作.也许这是错误的方法,但我只想批处理由列表或dict理解内的操作发起的长时间运行的请求.预先谢谢
Is there ways to to achieve this, maybe use threads and queue? i cannot see how to make the the list comprehension for my object work like this without it running sequentially. Maybe this is the wrong way, but i just want to batch long running requests initiated by operations within a list or dict comprehension. Thankyou in advance
,您需要使用threading
或multiprocessing
.
在Python3中也有 concurrent.futures
.
看看ThreadPoolExecutor
和ProcessPoolExecutor
.
also, in Python3, there is concurrent.futures
.
Take a look at ThreadPoolExecutor
and ProcessPoolExecutor
.
ThreadPoolExecutor
的示例几乎完全是您的要求:
The example in the docs for ThreadPoolExecutor
does almost exactly what you are asking:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
- 注意:通过PyPI上的
futures
包,Python 2可以使用类似的功能.