使用scrapy进行分页
我正在尝试抓取此网站:http://www.aido.com/eshop/cl_2-c_189-p_185/stationery/pens.html
I'm trying to crawl this website: http://www.aido.com/eshop/cl_2-c_189-p_185/stationery/pens.html
我可以获取此页面中的所有产品,但是如何在页面底部发出查看更多"链接的请求?
I can get all the products in this page, but how do I issue the request for "View More" link at the bottom of the page?
到目前为止我的代码是:
My code till now is:
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths='//li[@class="normalLeft"]/div/a',unique=True)),
Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="topParentChilds"]/div/div[@class="clm2"]/a',unique=True)),
Rule(SgmlLinkExtractor(restrict_xpaths='//p[@class="proHead"]/a',unique=True)),
Rule(SgmlLinkExtractor(allow=('http://[^/]+/[^/]+/[^/]+/[^/]+$', ), deny=('/about-us/about-us/contact-us', './music.html', ) ,unique=True),callback='parse_item'),
)
有什么帮助吗?
首先,你应该看看这个线程如何处理抓取 ajax 动态加载的内容:可以使用scrapy从使用 AJAX 的网站抓取动态内容?
First of all, you should take a look at this thread on how to deal with scraping ajax dynamically loaded content: Can scrapy be used to scrape dynamic content from websites that are using AJAX?
因此,点击查看更多"按钮会触发 XHR 请求:
So, clicking on "View More" button fires up an XHR request:
http://www.aido.com/eshop/faces/tiles/category.jsp?q=&categoryID=189&catalogueID=2&parentCategoryID=185&viewType=grid&bnm=&atmSize=&format=&gender=&ageRange=&actor=&director=&author=®ion=&compProductType=&compOperatingSystem=&compScreenSize=&compCpuSpeed=&compRam=&compGraphicProcessor=&compDedicatedGraphicMemory=&mobProductType=&mobOperatingSystem=&mobCameraMegapixels=&mobScreenSize=&mobProcessor=&mobRam=&mobInternalStorage=&elecProductType=&elecFeature=&elecPlaybackFormat=&elecOutput=&elecPlatform=&elecMegaPixels=&elecOpticalZoom=&elecCapacity=&elecDisplaySize=&narrowage=&color=&prc=&k1=&k2=&k3=&k4=&k5=&k6=&k7=&k8=&k9=&k10=&k11=&k12=&startPrize=&endPrize=&newArrival=&entityType=&entityId=&brandId=&brandCmsFlag=&boutiqueID=&nmt=&disc=&rat=&cts=empty&isBoutiqueSoldOut=undefined&sort=12&isAjax=true&hstart=24&targetDIV=searchResultDisplay
返回接下来 24 个项目的 text/html
.注意这个 hstart=24
参数:第一次点击查看更多"时它等于 24,第二次 - 48 等等.这应该是你的救星.
which returns text/html
of the next 24 items. Note this hstart=24
parameter: first time you click "View more" it's equal to 24, second time - 48 etc..this should be your lifesaver.
现在,您应该在蜘蛛中模拟这些请求.推荐的方法是实例化scrapy的Request对象提供回调,您将在其中提取数据.
Now, you should simulate these requests in your spider. The recommended way to do this is to instantiate scrapy's Request object providing callback where you'll extract the data.
希望有所帮助.