使用 R 模仿“点击"网页上的下载文件按钮

问题描述：

当我在本练习中探索了 2 种方法时，我的问题有 2 个部分，但我都没有成功.如果有人可以帮助我，不胜感激.

There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.

[第 1 部分:]

我试图从新加坡证券交易所的网页上抓取数据https://www2.sgx.com/derivatives/negotiated-large-trade 包含存储在表格中的数据.我有一些使用 (rvest) 抓取数据的基本知识.但是，在 chrome 上使用 Inspector，html 层次结构比我预期的要复杂得多.我能够看到我想要的数据隐藏在 < 下.div class="table-container" >，这是我绑定的:

I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:

library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")

但是，代码没有发现任何内容，我怀疑我是否正确使用了这些代码.

However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.

[第 2 部分:]

我意识到页面上有一个小的下载"按钮，可以准确下载我想要的 .csv 格式的数据文件.所以我想写一些代码来模仿下载按钮，我发现了这个问题使用 R 来点击"网页上的下载文件按钮，但我无法通过对该代码进行一些修改来使其正常工作.

As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.

网页上有一些过滤器，主要是我有兴趣下载特定工作日的数据，而将其他过滤器留空，所以我尝试编写以下函数:

There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:

library(httr)
library(rvest)
library(purrr)
library(dplyr)

crawlSGXdata = function(date){

POST("https://www2.sgx.com/derivatives/negotiated-large-trade", 
     body = NULL
     encode = "form",
     write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res) 
}

我打算将函数输入date"放入body"参数中，但是我无法弄清楚如何做到这一点，所以我从body = NULL"开始，假设它不这样做任何过滤.然而，结果仍然不尽如人意.文件下载基本为空，报错如下:

I was intended to put the function input "date" into the "body" argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:

Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400

答

内容是从返回 json 的 API 调用动态加载的.您可以通过开发工具在网络选项卡中找到它.

The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.

以下内容返回该内容.我找到结果的总页数，然后循环将每次调用返回的数据帧合并到一个包含所有结果的最终数据帧中.

The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.

library(jsonlite)

url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <-  jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'

if(num_pages > 1){
  for(i in seq(1, num_pages)){
    newUrl <- gsub("placeholder", i , url2)
    newdf <- jsonlite::fromJSON(newUrl)$data
    df <- rbind(df, newdf)
  }
}

使用 R 模仿“点击"网页上的下载文件按钮

相关推荐