Scrapy 在系统中同时使用 CORE

Scrapy 在系统中同时使用 CORE

问题描述:

我正在使用他们的内部 API 运行scrapy,到目前为止一切都很好.但是我注意到它没有完全使用设置中提到的 16 的并发性.我已将延迟更改为 0 以及我可以做的所有其他事情.但是然后查看正在发送的 HTTP 请求,很明显,scrapy 并不是在所有时间点都完全下载 16 个站点.在某些时候,它只下载 3 到 4 个链接.并且此时队列不为空.

I am running scrapy using their internal API and everything is well and good so far. But I noticed that its not fully using the concurrency of 16 as mentioned in the settings. I have changed delay to 0 and everything else I can do. But then looking into the HTTP requests being sent , its clear that scrapy is not exactly downloading 16 sites at all point of times. At some point of time its downloading only 3 to 4 links. And the queue is not empty at that point of time.

当我检查内核使用情况时,我发现在 2 个内核中,一个是 100%,另一个大部分是空闲的.

When I checked the core usage , what i found was that out of 2 core , one is 100% and other is mostly idle.

那时我才知道构建scrapy的twisted库是单线程的,这就是为什么它只使用单核.

That is when i got to know that twisted library on top which scrapy is build is single threaded and that is why its only using single core.

有什么解决方法可以说服scrapy使用所有核心吗?

Is there any workaround to convince scrapy to use all the core ?

另一种选择是使用 Scrapyd 运行蜘蛛,它允许您同时运行多个进程.请参阅 max_proc 和 max_proc_per_cpu 选项>文档.如果您不想以编程方式解决您的问题,这可能是您要走的路.

Another option is to run your spiders using Scrapyd, which lets you run multiple processes concurrently. See max_proc and max_proc_per_cpu options in the documentation. If you don't want to solve your problem programmatically, this could be the way to go.