如何向 Scrapy CrawlSpider 请求添加标头?

问题描述:

我正在使用 CrawlSpider 类来抓取网站,我想修改每个请求中发送的标头.具体来说,我想将引用添加到请求中.

I'm working with the CrawlSpider class to crawl a website and I would like to modify the headers that are sent in each request. Specifically, I would like to add the referer to the request.

根据这个问题,我检查了

response.request.headers.get('Referer', None)

在我的响应解析函数中,Referer 标头不存在.我认为这意味着请求中没有提交 Referer(除非网站没有返回它,我不确定).

in my response parsing function and the Referer header is not present. I assume that means the Referer is not being submitted in the request (unless the website doesn't return it, I'm not sure on that).

我一直无法弄清楚如何修改请求的标头.同样,我的蜘蛛是从 CrawlSpider 派生的.覆盖 CrawlSpider 的 _requests_to_follow 或为规则指定 process_request 回调将不起作用,因为此时引用者不在范围内.

I haven't been able to figure out how to modify the headers of a request. Again, my spider is derived from CrawlSpider. Overriding CrawlSpider's _requests_to_follow or specifying a process_request callback for a rule will not work because the referer is not in scope at those times.

有人知道怎么动态修改请求头吗?

Does anyone know how to modify request headers dynamically?

我不想回答我自己的问题,但我找到了如何去做.您必须启用 SpiderMiddleware 来填充引用以进行响应.请参阅文档了解scrapy.contrib.spidermiddleware.referer.RefererMiddleware

I hate to answer my own question, but I found out how to do it. You have to enable the SpiderMiddleware that will populate the referer for responses. See the documentation for scrapy.contrib.spidermiddleware.referer.RefererMiddleware

简而言之,您需要将此中间件添加到您项目的设置文件中.

In short, you need to add this middleware to your project's settings file.

SPIDER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
}

然后在您的响应解析方法中,您可以使用 response.request.headers.get('Referrer', None) 来获取引用者.

Then in your response parsing method you can use, response.request.headers.get('Referrer', None), to get the referer.

如果你马上理解这些中间件,再读一遍,休息一下,然后再读一遍.我发现它们非常令人困惑.

If you understand these middlewares right away, read them again, take a break, and then read them again. I found them to be very confusing.