Scrapy:如何从spider_idle 事件回调手动插入请求?
我创建了一个蜘蛛,并将一个方法链接到了 spider_idle 事件.
I've created a spider, and have linked a method to the spider_idle event.
如何手动添加请求?我不能只从 parse 返回项目——在这种情况下 parse 没有运行,因为所有已知的 URL 都已被解析.我有一个生成新请求的方法,我想从 spider_idle 回调中运行它以添加创建的请求.
How do I add a request manually? I can't just return the item from parse -- parse is not running in this case, as all known URLs have been parsed. I have a method to generate new requests, and I would like to run it from the spider_idle callback to add the created request(s).
class FooSpider(BaseSpider):
name = 'foo'
def __init__(self):
dispatcher.connect(self.dont_close_me, signals.spider_idle)
def dont_close_me(self, spider):
if spider != self:
return
# The engine instance will allow me to schedule requests, but
# how do I get the engine object?
engine = unknown_get_engine()
engine.schedule(self.create_request())
# afterward, ensure we stay alive by raising DontCloseSpider
raise DontCloseSpider("..I prefer live spiders.")
更新:我已经确定我可能需要 ExecutionEngine
对象,但我不完全知道如何从蜘蛛那里获取它,尽管它可以从一个 Crawler
实例.
UPDATE: I've determined that I probably need the ExecutionEngine
object, but I don't exactly know how to get that from a spider, though it available from a Crawler
instance.
更新 2: ..谢谢...crawler 作为超类的一个属性附加,所以我可以直接使用 self.crawler 而不需要额外的努力.>.>
UPDATE 2: ..thanks. ..crawler is attached as a property of the superclass, so I can just use self.crawler with no additional effort. >.>
class FooSpider(BaseSpider):
def __init__(self, *args, **kwargs):
super(FooSpider, self).__init__(*args, **kwargs)
dispatcher.connect(self.dont_close_me, signals.spider_idle)
def dont_close_me(self, spider):
if spider != self:
return
self.crawler.engine.crawl(self.create_request(), spider)
raise DontCloseSpider("..I prefer live spiders.")
2016 年更新:
class FooSpider(BaseSpider):
yet = False
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
from_crawler = super(FooSpider, cls).from_crawler
spider = from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.idle, signal=scrapy.signals.spider_idle)
return spider
def idle(self):
if not self.yet:
self.crawler.engine.crawl(self.create_request(), self)
self.yet = True