Scrapy - 设置 TCP 连接超时

问题描述:

我正在尝试通过 Scrapy 抓取网站.但是,该网站有时非常慢,在浏览器中首次请求响应需要近 15-20 秒.无论如何,有时,当我尝试使用 Scrapy 抓取网站时,我不断收到 TCP 超时错误.即使该网站在我的浏览器上打开得很好.消息如下:

I'm trying to scrape a website via Scrapy. However, the website is extremely slow at times and it takes almost 15-20 seconds to respond at first request in browser. Anyways, sometimes, when I try to crawl the website using Scrapy, I keep getting TCP Timeout error. Even though the website opens just fine on my browser. Here's the message:

2017-09-05 17:34:41 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.hosane.com/result/spec
ialList> (failed 16 times): TCP connection timed out: 10060: A connection attempt failed because the connected party di
d not properly respond after a period of time, or established connection failed because connected host has failed to re
spond..

我什至覆盖了 USER_AGENT 设置进行测试.我认为 DOWNLOAD_TIMEOUT 设置在这种情况下不起作用,因为它默认为 180 秒,而 Scrapy 甚至不需要 20-30 秒就给出 TCP 超时错误.

I have even overridden the USER_AGENT setting for testing. I don't think DOWNLOAD_TIMEOUT setting works in this case, since it defaults to 180 seconds, and Scrapy doesn't even take 20-30 seconds before giving a TCP timeout error.

知道是什么导致了这个问题吗?有没有办法在 Scrapy 中设置 TCP 超时?

Any idea what is causing this issue? Is there a way to set TCP timeout in Scrapy?

TCP connection timed out 可能发生在 Scrapy 指定的 DOWNLOAD_TIMEOUT 之前,因为实际的初始 TCP 连接超时由操作系统定义,通常根据 TCP SYN 数据包重传.

A TCP connection timed out can happen before the Scrapy-specified DOWNLOAD_TIMEOUT because the actual initial TCP connect timeout is defined by the OS, usually in terms of TCP SYN packet retransmissions.

默认在我的 Linux 机器上,我有 6 次重传:

By default on my Linux box, I have 6 retransmissions:

cat /proc/sys/net/ipv4/tcp_syn_retries
6

在实践中,对于 Scrapy 也意味着 0 + 1 + 2 + 4 + 8 + 16 + 32 (+64) = 127 秒 在收到 twisted.internet 之前.error.TCPTimedOutError:TCP 连接超时:110:连接超时. 来自 Twisted.(这是初始试验,然后在每次重试和第 6 次重试后未收到回复之间呈指数退避.)

which, in practice, for Scrapy too, means 0 + 1 + 2 + 4 + 8 + 16 + 32 (+64) = 127 seconds before receiveing a twisted.internet.error.TCPTimedOutError: TCP connection timed out: 110: Connection timed out. from Twisted. (That's the initial trial, then exponential backoff between each retry and not receiving a reply after the 6th retry.)

例如,如果我将 /proc/sys/net/ipv4/tcp_syn_retries 设置为 8,我可以验证我是否收到:

If I set /proc/sys/net/ipv4/tcp_syn_retries to 8 for example, I can verify that I receive this instead:

User timeout caused connection failure: Getting http://www.hosane.com/result/specialList took longer than 180.0 seconds.

那是因为 0+1+2+4+8+16+32+64+128(+256) >180.

10060:连接尝试失败... 似乎是 Windows 套接字错误代码.如果要将 TCP 连接超时更改为至少 DOWNLOAD_TIMEOUT,则需要更改 TCP SYN 重试计数.(我不知道如何在您的系统上执行此操作,但 Google 是您的朋友.)

10060: A connection attempt failed... seems to be a Windows socket error code. If you want to change the TCP connection timeout to something at least the DOWNLOAD_TIMEOUT, you'll need to change the TCP SYN retry count. (I don't know how to do it on your system, but Google is your friend.)