在python中使用scrapy执行Javascript提交表单功能

在python中使用scrapy执行Javascript提交表单功能

问题描述:

我正在使用 scrapy 框架废弃一个网站,但在单击 javascript 链接以打开另一个页面时遇到问题.

I am scrapping a site using scrapy framework and having trouble clicking on a javascript link for opening another page.

我可以将页面上的代码识别为:

I can identify the code on the page as:

<a class="Page" alt="Click to view job description" title="Click to view job description" href="javascript:sysSubmitForm('frmSR1');">Accountant&nbsp;</a>

任何人都可以建议我如何在 scaroy 中执行该 javascript 并通过我可以从该页面获取数据的另一个页面.

can any one suggest me how to execute that javascript in scaroy and get another page through i can fetch data from that page.

提前致谢

查看下面关于如何在 selenium 中使用scrapy 的片段.爬行会变慢,因为您不仅要下载 html,而且还可以完全访问 DOM.

Checkout the below snipped on how to use scrapy with selenium. Crawling will be slower as you aren't just downloading the html but you will get full access to the DOM.

注意:我复制粘贴了这个片段,因为之前提供的链接不再有效.

Note: I have copy-pasted this snippet as the links previously provided no longer work.

# Snippet imported from snippets.scrapy.org (which no longer works)

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request

from selenium import selenium

class SeleniumSpider(CrawlSpider):
    name = "SeleniumSpider"
    start_urls = ["http://www.domain.com"]

    rules = (
        Rule(SgmlLinkExtractor(allow=('\.html', )),
        callback='parse_page',follow=True),
    )

    def __init__(self):
        CrawlSpider.__init__(self)
        self.verificationErrors = []
        self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
        self.selenium.start()

    def __del__(self):
        self.selenium.stop()
        print self.verificationErrors
        CrawlSpider.__del__(self)

    def parse_page(self, response):
        item = Item()

        hxs = HtmlXPathSelector(response)
        #Do some XPath selection with Scrapy
        hxs.select('//div').extract()

        sel = self.selenium
        sel.open(response.url)

        #Wait for javscript to load in Selenium
        time.sleep(2.5)

        #Do some crawling of javascript created content with Selenium
        sel.get_text("//div")
        yield item