获取国外网站的html，过一会就报错time out,该如何解决

获取国外网站的html，过一会就报错time out

#! /usr/bin/env python

#coding=utf-8


# import sys

import re

import urllib


def getHtml(url):

	html = urllib.urlopen(url).read()	

	# request.close()

	return html


i = 0

j = 0

while j < 120:

	#*************************************************改一*************************************

	#Arts,Computers,Family,Health,Home,Lifestyle,Sports+%26+Recreation

	b = 'http://askville.amazon.com/Computers/Category.do?cat=Computers&page=' + str(j) + '&filter=AllQAndA'

        h = getHtml(b)

        urllib.urlretrieve(b,r'E://spider//askville//Arts//%d.html' % i)

       i += 1

       j += 1

获取速度很慢我也忍了，可是不断报错：
Traceback (most recent call last):
File "E:\spider\askville\down_askville.py", line 40, in <module>
    h = getHtml(b)
File "E:\spider\askville\down_askville.py", line 15, in getHtml
    html = urllib.urlopen(url).read()
  File "D:\Python27\lib\urllib.py", line 87, in urlopen
    return opener.open(url)
  File "D:\Python27\lib\urllib.py", line 208, in open
    return getattr(self, name)(url)
  File "D:\Python27\lib\urllib.py", line 346, in open_http
    errcode, errmsg, headers = h.getreply()
  File "D:\Python27\lib\httplib.py", line 1117, in getreply
    response = self._conn.getresponse()
  File "D:\Python27\lib\httplib.py", line 1045, in getresponse
    response.begin()
  File "D:\Python27\lib\httplib.py", line 409, in begin
    version, status, reason = self._read_status()
  File "D:\Python27\lib\httplib.py", line 365, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "D:\Python27\lib\socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
IOError: [Errno socket error] timed out
[Finished in 237.1s with exit code 1]
------解决思路----------------------
自己访问速度怎么样？网速慢，超时很正常
------解决思路----------------------
timeout机制的作用就是“太慢了，不等了”。太慢的时候会有timeout太正常了。

你可以用try/except捕捉timeout产生的IOError异常，再重新请求。

另外，urllib2里的函数支持timeout选项，你可以用大点的timeout值。但不管timeout有多大，总有可能有IOError，所以处理异常还是必需的。
------解决思路----------------------
先可以试下国内的网站
------解决思路----------------------
不排除人家认为你是在攻击
请求之间sleep试试

------解决思路----------------------



import urllib.request

import concurrent.futures


URLS = [ 'http://askville.amazon.com/Computers/Category.do?cat=Computers&page=%s&filter=AllQAndA'%i  for i in range(120)]


def load_url(url, timeout):

    conn = urllib.request.urlopen(url, timeout=timeout)

    return conn.readall()


def save_file(fname, content):

    print('fname: %s'%fname)

    saveFile = open(fname,'w')

    saveFile.write(str(content))

    saveFile.close()  



with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:

    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}

    for future in concurrent.futures.as_completed(future_to_url):

        url = future_to_url[future]

        try:

            data = future.result()

        except Exception as exc:

            print('%r generated an exception: %s' % (url, exc))

        else:

            print('%r page is %d bytes' % (url, len(data))) 

            index = url.split('&')[1].split('=')[1]

            save_file('c:/%s.html' % index, data)

python 3.3

获取国外网站的html，过一会就报错time out,该如何解决

相关推荐