如何批处理使用python中的理解执行的异步Web请求?

问题描述：

不确定是否可行，花一些时间看似相似的问题，但仍不清楚.有关网站网址的列表，我需要以html为起点.

not sure if this is possible, spend some time looking at what seem like similar questions, but still unclear. For a list of website urls, I need to get the html as a starting point.

我有一个包含这些url列表的类，该类返回一个自定义迭代器，可帮助我遍历这些URL以获取html(以下简化)

I have a class that contains a list of these urls and the class returns a custom iterator that helps me iterate through these to get the html (simplified below)

class Url:
   def __init__(self, url)
      self.url = url

   def fetchhtml(self)
      import urllib2
      response = urllib2.urlopen(self.url)
      return response.read()

class MyIterator:
   def __init__(self, obj):
       self.obj=obj
       self.cnt=0

   def __iter__(self):
       return self

   def next(self):
       try:
           result=self.obj.get(self.cnt)
           self.cnt+=1
           return result
       except IndexError:
           raise StopIteration  

class Urls:
   def __init__(self, url_list = []):
       self.list = url_list

   def __iter__(self):
       return MyIterator(self)

   def get(self, index):
       return self.list[index]

2-我希望能够使用

url_list = [url1, url2, url3]
urls = Urls(url_list)
html_image_list = {url.url: re.search('@src="([^"]+)"', url.fetchhtml()) for url in urls}

3-我遇到的问题是我想对所有请求进行批处理，而不是让fetchhtml在列表上按顺序进行操作，一旦完成，就提取图像列表.

3 - problem i have is that I want to batch all the requests rather than having fetchhtml operate sequentially on my list, and once they are done then extract the image list.

是否有实现此目标的方法，也许使用线程和队列?如果没有按顺序运行，我看不到如何使对象的列表理解能力像这样工作.也许这是错误的方法，但我只想批处理由列表或dict理解内的操作发起的长时间运行的请求.预先谢谢

Is there ways to to achieve this, maybe use threads and queue? i cannot see how to make the the list comprehension for my object work like this without it running sequentially. Maybe this is the wrong way, but i just want to batch long running requests initiated by operations within a list or dict comprehension. Thankyou in advance

答

，您需要使用threading或multiprocessing.

在Python3中也有 concurrent.futures . 看看ThreadPoolExecutor和ProcessPoolExecutor.

also, in Python3, there is concurrent.futures. Take a look at ThreadPoolExecutor and ProcessPoolExecutor.

ThreadPoolExecutor的示例几乎完全是您的要求:

The example in the docs for ThreadPoolExecutor does almost exactly what you are asking:

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    conn = urllib.request.urlopen(url, timeout=timeout)
    return conn.readall()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

注意:通过PyPI上的 futures 包，Python 2可以使用类似的功能.

如何批处理使用python中的理解执行的异步Web请求?

相关推荐