获取国外网站的html,过一会就报错time out,该如何解决
获取国外网站的html,过一会就报错time out
获取速度很慢我也忍了,可是不断报错:
Traceback (most recent call last):
File "E:\spider\askville\down_askville.py", line 40, in <module>
h = getHtml(b)
File "E:\spider\askville\down_askville.py", line 15, in getHtml
html = urllib.urlopen(url).read()
File "D:\Python27\lib\urllib.py", line 87, in urlopen
return opener.open(url)
File "D:\Python27\lib\urllib.py", line 208, in open
return getattr(self, name)(url)
File "D:\Python27\lib\urllib.py", line 346, in open_http
errcode, errmsg, headers = h.getreply()
File "D:\Python27\lib\httplib.py", line 1117, in getreply
response = self._conn.getresponse()
File "D:\Python27\lib\httplib.py", line 1045, in getresponse
response.begin()
File "D:\Python27\lib\httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "D:\Python27\lib\httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "D:\Python27\lib\socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
IOError: [Errno socket error] timed out
[Finished in 237.1s with exit code 1]
------解决思路----------------------
自己访问速度怎么样?网速慢,超时很正常
------解决思路----------------------
timeout机制的作用就是“太慢了,不等了”。太慢的时候会有timeout太正常了。
你可以用try/except捕捉timeout产生的IOError异常,再重新请求。
另外,urllib2里的函数支持timeout选项,你可以用大点的timeout值。但不管timeout有多大,总有可能有IOError,所以处理异常还是必需的。
------解决思路----------------------
先可以试下国内的网站
------解决思路----------------------
不排除人家认为你是在攻击
请求之间sleep试试
------解决思路----------------------
python 3.3
#! /usr/bin/env python
#coding=utf-8
# import sys
import re
import urllib
def getHtml(url):
html = urllib.urlopen(url).read()
# request.close()
return html
i = 0
j = 0
while j < 120:
#*************************************************改一*************************************
#Arts,Computers,Family,Health,Home,Lifestyle,Sports+%26+Recreation
b = 'http://askville.amazon.com/Computers/Category.do?cat=Computers&page=' + str(j) + '&filter=AllQAndA'
h = getHtml(b)
urllib.urlretrieve(b,r'E://spider//askville//Arts//%d.html' % i)
i += 1
j += 1
获取速度很慢我也忍了,可是不断报错:
Traceback (most recent call last):
File "E:\spider\askville\down_askville.py", line 40, in <module>
h = getHtml(b)
File "E:\spider\askville\down_askville.py", line 15, in getHtml
html = urllib.urlopen(url).read()
File "D:\Python27\lib\urllib.py", line 87, in urlopen
return opener.open(url)
File "D:\Python27\lib\urllib.py", line 208, in open
return getattr(self, name)(url)
File "D:\Python27\lib\urllib.py", line 346, in open_http
errcode, errmsg, headers = h.getreply()
File "D:\Python27\lib\httplib.py", line 1117, in getreply
response = self._conn.getresponse()
File "D:\Python27\lib\httplib.py", line 1045, in getresponse
response.begin()
File "D:\Python27\lib\httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "D:\Python27\lib\httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "D:\Python27\lib\socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
IOError: [Errno socket error] timed out
[Finished in 237.1s with exit code 1]
------解决思路----------------------
自己访问速度怎么样?网速慢,超时很正常
------解决思路----------------------
timeout机制的作用就是“太慢了,不等了”。太慢的时候会有timeout太正常了。
你可以用try/except捕捉timeout产生的IOError异常,再重新请求。
另外,urllib2里的函数支持timeout选项,你可以用大点的timeout值。但不管timeout有多大,总有可能有IOError,所以处理异常还是必需的。
------解决思路----------------------
先可以试下国内的网站
------解决思路----------------------
不排除人家认为你是在攻击
请求之间sleep试试
------解决思路----------------------
import urllib.request
import concurrent.futures
URLS = [ 'http://askville.amazon.com/Computers/Category.do?cat=Computers&page=%s&filter=AllQAndA'%i for i in range(120)]
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
def save_file(fname, content):
print('fname: %s'%fname)
saveFile = open(fname,'w')
saveFile.write(str(content))
saveFile.close()
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
index = url.split('&')[1].split('=')[1]
save_file('c:/%s.html' % index, data)
python 3.3