使用 R 进行网页抓取,Javascript 被禁用的消息
问题描述:
您好,我正在尝试使用 R 进行网络抓取,而这个特定的网站给我带来了很多麻烦.我想从这里提取表格:https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017
Hello I am attempting to webscrape in R and this one particular website is giving me a lot of trouble. I wish to extract the table from here: https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017
我尝试过的
代码:
url = 'https://www.nationsreportcard.gov/profiles/stateprofile?chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'
webpage = read_html(url)
data = webpage %>% html_nodes('p') %>% html_text()
data
输出:
[1] "\r\n The page could not be loaded. This web site
currently does not fully support browsers with \"JavaScript\" disabled.
Please note that if you choose to continue without enabling
\"JavaScript\" certain functionalities on this website may not be
available.\r\n
答
在这种情况下,您可能需要使用 RSelenium
使用 docker
抓取 Javascript 网站
In this cases, you may want to use RSelenium
with docker
to scrape a Javascript website
require("RSelenium")
require("rvest")
system('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- RSelenium::remoteDriver(
remoteServerAddr = "localhost",
port = 4445L,
browserName = "firefox"
)
#Start the remote driver
remDr$open()
url = 'https://www.nationsreportcard.gov/profiles/stateprofile?
chort=1&sub=MAT&sj=&sfj=NP&st=MN&year=2017'
remDr$navigate(url)
doc <- read_html(remDr$getPageSource()[[1]])
table <- doc %>%
html_nodes(xpath = '//*[@id="gridAvergeScore"]/table') %>%
html_table(fill=TRUE)
head(table[[1]])
## JURISDICTION AVERAGE SCORE (0 - 500) AVERAGE SCORE (0 - 500) ACHIEVEMENT LEVEL PERCENTAGES ACHIEVEMENT LEVEL PERCENTAGES
## 1 JURISDICTION Score Difference from National public (NP) At or above Basic At or above Proficient
## 2 Massachusetts 249 10 87 53
## 3 Minnesota 249 10 86 53
## 4 DoDEA 249 9 91 51
## 5 Virginia 248 9 87 50
## 6 New Jersey 248 9 87 50