如何将实例变量添加到 Scrapy CrawlSpider?

问题描述:

我正在运行 CrawlSpider,我想通过将函数传递给 process_request 来实现一些逻辑,以在运行中停止跟踪某些链接.

I am running a CrawlSpider and I want to implement some logic to stop following some of the links in mid-run, by passing a function to process_request.

该函数使用蜘蛛的 class 变量来跟踪当前状态,并根据它(以及引用 URL),链接被删除或继续被处理:>

This function uses the spider's class variables in order to keep track of the current state, and depending on it (and on the referrer URL), links get dropped or continue to be processed:

class BroadCrawlSpider(CrawlSpider):
    name = 'bitsy'
    start_urls = ['http://scrapy.org']
    foo = 5

    rules = (
        Rule(LinkExtractor(), callback='parse_item', process_request='filter_requests', follow=True),
    )

    def parse_item(self, response):
        <some code>

    def filter_requests(self, request):
        if self.foo == 6 and request.headers.get('Referer', None) == someval:
             raise IgnoreRequest("Ignored request: bla %s" % request)
        return request

我认为如果我要在同一台机器上运行多个蜘蛛,它们都会使用相同的 class 变量,这不是我的本意.

I think that if I were to run several spiders on the same machine, they would all use the same class variables which is not my intention.

有没有办法向 CrawlSpiders 添加实例变量?运行 Scrapy 时是否只创建了一个蜘蛛实例?

Is there a way to add instance variables to CrawlSpiders? Is only a single instance of the spider created when I run Scrapy?

我可能可以使用带有每个进程 ID 值的字典来解决这个问题,但这会很难看...

I could probably work around it with a dictionary with values per process ID, but that will be ugly...

我认为 蜘蛛参数 将是您的解决方案.

I think spider arguments would be the solution in your case.

当像scrapy crawl some_spider这样调用scrapy时,你可以添加像scrapy crawl some_spider -a foo=bar这样的参数,蜘蛛会通过它的构造函数接收值,例如:

When invoking scrapy like scrapy crawl some_spider, you could add arguments like scrapy crawl some_spider -a foo=bar, and the spider would receive the values via its constructor, e.g.:

class SomeSpider(scrapy.Spider):
    def __init__(self, foo=None, *args, **kwargs):
        super(SomeSpider, self).__init__(*args, **kwargs)
        # Do something with foo

更重要的是,作为 scrapy.Spider 实际上将所有附加参数设置为实例属性,您甚至不需要显式覆盖 __init__ 方法,只需访问 .foo 属性.:)

What's more, as scrapy.Spider actually sets all additional arguments as instance attributes, you don't even need to explicitly override the __init__ method but just access the .foo attribute. :)