将抓取的 URL 从一个蜘蛛传递到另一个

问题描述:

如何将抓取的 URL 从一个蜘蛛发送到另一个蜘蛛的 start_urls?

How can I send the scraped URL's from one spider to the start_urls of another spider?

具体来说,我想运行一个从 XML 页面获取 URL 列表的蜘蛛.检索到 URL 后,我希望它们被另一个蜘蛛用于抓取.

Specifically, I want to run one spider which gets a list of URL's from an XML page. After the URL's have been retrieved I want them to by used by another spider for scraping.

from scrapy.spiders import SitemapSpider

class Daily(SitemapSpider):
    name = 'daily'
    sitemap_urls = ['http://example.com/sitemap.xml']

    def parse(self, response):
        print response.url

        # How do I send these URL's to another spider instead?

        yield {
            'url': response.url
        }

从第一个蜘蛛开始,您可以将 url 保存在某个数据库中或通过管道发送到某个队列(Zerro、Rabbit MQ、Redis).

From first spider you can save url in some DB or send to some queue (Zerro, Rabbit MQ, Redis) for example via pipeline.

第二个蜘蛛可以通过方法获取url - 开始请求

Second spider can get the url with method - start_requests

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        urls = my_db.orm.get('urls');
        for url in urls:
            yield scrapy.Request(url)

或者 url 可以通过 cli 或 API.或者可以从代理启动蜘蛛,然后启动蜘蛛通过 start_requests 获取他的 url.

Or urls can be passed to spider from queue broker via cli or API. Or spider can be just launched from broker and launched spider get his url by start_requests.

确实有很多方法可以做到.方式取决于您需要将网址从一个蜘蛛传递到另一个蜘蛛的标准.

Really exists many ways how you can do it. The way depend of the criteria why you need to pass urls from one spider to other.

您可以查看以下项目:Scrapy-Cluster、Scrapy-Redis.可能就是您要搜索的内容.

You can check this projects: Scrapy-Cluster, Scrapy-Redis. May be its what you searching for.