使用 R 进行网页抓取，Javascript 被禁用的消息

问题描述：

您好，我正在尝试使用 R 进行网络抓取，而这个特定的网站给我带来了很多麻烦.我想从这里提取表格:https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017

Hello I am attempting to webscrape in R and this one particular website is giving me a lot of trouble. I wish to extract the table from here: https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017

我尝试过的

代码:

url = 'https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'

webpage = read_html(url)

data = webpage %>% html_nodes('p') %>% html_text()
data

输出:

[1] "\r\n            The page could not be loaded. This web site 
currently does not fully support browsers with \"JavaScript\" disabled. 
Please note that if you choose to continue without enabling 
\"JavaScript\" certain functionalities on this website may not be 
available.\r\n

答

在这种情况下，您可能需要使用 RSelenium 使用 docker 抓取 Javascript 网站

In this cases, you may want to use RSelenium with docker to scrape a Javascript website

require("RSelenium")
require("rvest")
system('docker run -d -p 4445:4444 selenium/standalone-firefox')

remDr <-  RSelenium::remoteDriver(
  remoteServerAddr = "localhost",
  port = 4445L,
  browserName = "firefox"
)

#Start the remote driver
remDr$open()


url = 'https://www.nationsreportcard.gov/profiles/stateprofile? 
chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'

remDr$navigate(url)

doc <- read_html(remDr$getPageSource()[[1]])
table <- doc %>%
         html_nodes(xpath = '//*[@id="gridAvergeScore"]/table') %>%
         html_table(fill=TRUE)

head(table[[1]])

##    JURISDICTION AVERAGE SCORE (0 - 500)              AVERAGE SCORE (0 - 500) ACHIEVEMENT LEVEL PERCENTAGES ACHIEVEMENT LEVEL PERCENTAGES
## 1  JURISDICTION                   Score Difference from National public (NP)             At or above Basic        At or above Proficient
## 2 Massachusetts                     249                                   10                            87                            53
## 3     Minnesota                     249                                   10                            86                            53
## 4         DoDEA                     249                                    9                            91                            51
## 5      Virginia                     248                                    9                            87                            50
## 6    New Jersey                     248                                    9                            87                            50

使用 R 进行网页抓取，Javascript 被禁用的消息

相关推荐