如何向 Scrapy CrawlSpider 请求添加标头?
我正在使用 CrawlSpider 类来抓取网站,我想修改每个请求中发送的标头.具体来说,我想将引用添加到请求中.
I'm working with the CrawlSpider class to crawl a website and I would like to modify the headers that are sent in each request. Specifically, I would like to add the referer to the request.
根据这个问题,我检查了
response.request.headers.get('Referer', None)
在我的响应解析函数中,Referer
标头不存在.我认为这意味着请求中没有提交 Referer(除非网站没有返回它,我不确定).
in my response parsing function and the Referer
header is not present. I assume that means the Referer is not being submitted in the request (unless the website doesn't return it, I'm not sure on that).
我一直无法弄清楚如何修改请求的标头.同样,我的蜘蛛是从 CrawlSpider 派生的.覆盖 CrawlSpider 的 _requests_to_follow
或为规则指定 process_request
回调将不起作用,因为此时引用者不在范围内.
I haven't been able to figure out how to modify the headers of a request. Again, my spider is derived from CrawlSpider. Overriding CrawlSpider's _requests_to_follow
or specifying a process_request
callback for a rule will not work because the referer is not in scope at those times.
有人知道怎么动态修改请求头吗?
Does anyone know how to modify request headers dynamically?
我不想回答我自己的问题,但我找到了如何去做.您必须启用 SpiderMiddleware 来填充引用以进行响应.请参阅文档了解scrapy.contrib.spidermiddleware.referer.RefererMiddleware
I hate to answer my own question, but I found out how to do it. You have to enable the SpiderMiddleware that will populate the referer for responses. See the documentation for scrapy.contrib.spidermiddleware.referer.RefererMiddleware
简而言之,您需要将此中间件添加到您项目的设置文件中.
In short, you need to add this middleware to your project's settings file.
SPIDER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
}
然后在您的响应解析方法中,您可以使用 response.request.headers.get('Referrer', None)
来获取引用者.
Then in your response parsing method you can use, response.request.headers.get('Referrer', None)
, to get the referer.
如果你马上理解这些中间件,再读一遍,休息一下,然后再读一遍.我发现它们非常令人困惑.
If you understand these middlewares right away, read them again, take a break, and then read them again. I found them to be very confusing.