在python中使用scrapy执行Javascript函数

问题描述:

我对scrapy"非常陌生,我正在废弃一个网站,因为我有一些锚标签,其中包含带有 java 脚本 SubmitForm 函数的 href 属性.当我单击该 javascript 函数时,一个页面正在打开,我需要从中获取数据.我使用了 Xpath 并找到了特定锚标记的 href,但无法执行包含 javascript 函数的该 href 属性.谁能告诉我如何在scrapy python中执行javascript提交锚标记的功能.我的HTML代码是

I am very new to "scrapy", i am scrapping a website and in that i had some anchor tags which consists of href attributes with java script SubmitForm functions. When i clicked that javascript function a page is opening from which i need to fetch data.I used Xpath and found href for particular anchor tags but unable to execute that href attribute that contains javascript function. Can anyone tell me how to execute javascript Submit functions of anchor tags in scrapy python.My HTML code is

   <table class="Tbl" cellspacing="2" cellpadding="0" border="0">
     <tbody>
        <tr>
           <td class="TblOddRow">
             <table cellspacing="0" cellpadding="0" border="0">
               <tbody>
                 <tr>
                   <td valign="middle" nowrap="">
                        <a class="Page" alt="Click to view job description" title="Click to view job description" href="javascript:sysSubmitForm('frmSR1');">Accountant&nbsp;</a>
                   </td>
                 </tr>
               </tbody>
             </table>
           </td>
        </tr>
      </tbody>
  </table>                      

蜘蛛代码是

class MountSinaiSpider(BaseSpider):
   name = "mountsinai"
   allowed_domains = ["mountsinaicss.igreentree.com"]
   start_urls = [
       "https://mountsinaicss.igreentree.com/css_external/CSSPage_SearchAndBrowseJobs.ASP?T=20120517011617&",
   ]
   def parse(self, response):
       return [FormRequest.from_response(response,
                                        formdata={ "Type":"CSS","SRCH":"Search&nbsp;Jobs","InitURL":"CSSPage_SearchAndBrowseJobs.ASP","RetColsQS":"Requisition.Key¤Requisition.JobTitle¤Requisition.fk_Code_Full_Part¤[Requisition.fk_Code_Full_Part]OLD.Description(sysfk_Code_Full_PartDesc)¤Requisition.fk_Code_Location¤[Requisition.fk_Code_Location]OLD.Description(sysfk_Code_LocationDesc)¤Requisition.fk_Code_Dept¤[Requisition.fk_Code_Dept]OLD.Description(sysfk_Code_DeptDesc)¤Requisition.Req¤","RetColsGR":"Requisition.Key¤Requisition.JobTitle¤Requisition.fk_Code_Full_Part¤[Requisition.fk_Code_Full_Part]OLD.Description(sysfk_Code_Full_PartDesc)¤Requisition.fk_Code_Location¤[Requisition.fk_Code_Location]OLD.Description(sysfk_Code_LocationDesc)¤Requisition.fk_Code_Dept¤[Requisition.fk_Code_Dept]OLD.Description(sysfk_Code_DeptDesc)¤Requisition.Req¤","ResultSort":"" },
                                        callback=self.parse_main_list)]
   def parse_main_list(self, response):
       hxs = HtmlXPathSelector(response)
       firstpage_urls = hxs.select("//table[@class='Tbl']/tr/td/table/tr/td")

   for link in firstpage_urls:
       hrefs = link.select('a/@href').extract()

Scrapy 不允许您执行 javascript 提交功能".为此,您必须使用 Splash 或支持与 JavaScript 交互的类似替代方案.Scrapy 仅适用于底层 HTML.

Scrapy does not let you "execute javascript Submit functions". For that you would have to use Splash or a similar alternative that supports interaction with JavaScript. Scrapy works only with the underlying HTML.

要解决 Scrapy 的问题,您可以做的是弄清楚 JavaScript 代码如何构建请求,并使用 Scrapy 重现此类请求.

What you can do to solve the issue with Scrapy is to figure out how the JavaScript code builds a request, and reproduce such a request with Scrapy.

要弄清楚 JavaScript 代码的作用,您有两个选择:

To figure out what the JavaScript code does, you have two options:

  • 在页面 JavaScript 代码中找到 sysSubmitForm() 的定义,自己阅读 JavaScript 代码了解它的作用.

  • Find the definition of sysSubmitForm() in the page JavaScript code, and find out what it does by reading the JavaScript code yourself.

使用您的网络浏览器的开发者工具的网络选项卡来观察当您触发该 JavaScript 代码时发送到服务器的请求,并检查请求以找出如何自己构建类似的请求.

Use the Network tab of the Developer Tools of your web browser to watch what request is sent to the server when you trigger that JavaScript code, and inspect the request to figure out how to build a similar request yourself.