Scrapy 暂停/恢复如何工作?

问题描述:

有人可以向我解释一下 Scrapy 中的暂停/恢复功能是如何工作的吗?

Can someone explain to me how the pause/resume feature in Scrapy works?

我使用的scrapy版本是0.24.5

文档没有提供太多细节.

我有以下简单的蜘蛛:

class SampleSpider(Spider):
name = 'sample'

def start_requests(self):
        yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1053')
        yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1054')
        yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1055')

def parse(self, response):
    with open('responses.txt', 'a') as f:
        f.write(response.url + '\n')

我正在使用:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals


from scrapyproject.spiders.sample_spider import SampleSpider
spider = SampleSpider()
settings = get_project_settings()
settings.set('JOBDIR', '/some/path/scrapy_cache')
settings.set('DOWNLOAD_DELAY', 10)
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() 

如您所见,我启用了 JOBDIR 选项,以便我可以保存我的抓取状态.

As you can see, I enabled the JOBDIR option so that I can save the state of my crawl.

我将 DOWNLOAD_DELAY 设置为 10 seconds 以便我可以在处理请求之前停止爬虫.我原以为下次运行蜘蛛时,不会重新生成请求.事实并非如此.

I set the DOWNLOAD_DELAY to 10 seconds so that I can stop the spider before the requests are processed. I would have expected that the next time I run the spider, the requests will not be regenerated. That is not the case.

我在scrapy_cache 文件夹中看到一个名为requests.queue 的文件夹.然而,那总是空的.

I see in my scrapy_cache folder a folder named requests.queue. However, that is always empty.

看起来 requests.seen 文件正在保存发出的请求(使用 SHA1 哈希),这很棒.但是,下次我运行蜘蛛时,会重新生成请求,并将(重复的)SHA1 哈希添加到文件中.我在 Scrapy 代码中跟踪了这个问题,看起来 RFPDupeFilter 使用a+"标志打开 requests.seen 文件.所以它总是会丢弃文件中以前的值(至少这是我的 Mac OS X 上的行为).

It looks like the requests.seen file is saving the issued requests (using SHA1 hashes) which is great. However, the next time I run the spider, the requests are regenerated and the (duplicate) SHA1 hashes are added to the file. I tracked this issue in the Scrapy code and it looks like the RFPDupeFilter opens the requests.seen file with an 'a+' flag. So it will always discard the previous values in the file (at least that is the behavior on my Mac OS X).

最后,关于spider状态,我可以从Scrapy代码中看到spider状态在spider关闭时保存,在spider打开时读回.但是,如果发生异常(例如,机器关闭),这不是很有帮助.我必须定期储蓄吗?

Finally, regarding spider state, I can see from the Scrapy code that the spider state is saved when the spider is closed and is read back when it's opened. However, that is not very helpful if an exception occurs (e.g., the machine shuts down). Do I have to be saving periodically?

我在这里的主要问题是:使用 Scrapy 同时期望抓取会多次停止/恢复(例如,抓取一个非常大的网站时)的常见做法是什么?

The main question I have here is: What's the common practice to use Scrapy while expecting that the crawl will stop/resume multiple times (e.g., when crawling a very big website)?

为了能够暂停和恢复scrapy搜索,你可以运行这个命令开始搜索:

For being able to pause and resume the scrapy search, you can run this command for starting the search:

scrapy crawl somespider --set JOBDIR=crawl1

要停止搜索,您应该运行 control-C,但您只需运行一次并等待 scrapy 停止,如果您运行 control-C 两次,它将无法正常工作.

for stopping the search you should run control-C, but you have to run it just once and wait for scrapy to stop, if you run control-C twice it wont work properly.

然后您可以通过再次运行此命令来恢复搜索:

then you can resume your search by running this command again:

scrapy crawl somespider --set JOBDIR=crawl1