如何使用R中的WikipediR包从Wikipedia页面获取数据?

问题描述:

我需要从多个Wikipedia页面中获取部分数据.我如何使用WikipediR软件包来做到这一点?还是有其他更好的选择呢?准确地说,我只需要所有页面中以下标记的部分.

I need to fetch a certain part of data from multiple Wikipedia pages. How can I do that using WikipediR package? Or is there some other better option for the same. To be precise, I need only the below marked part from all the pages.

我该怎么办?任何帮助,将不胜感激.

How can I get that? Any help would be appreciated.

关于您想要的内容,您能否更具体一些?这是从网络(尤其是从Wikipedia)导入数据的简单方法.

Can you be a little more specific as to what you want? Here's a simple way to import data from the web, and specifically from Wikipedia.

library(rvest)    
scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"

## ********************
## Option 1: Grab the tables from the page and use the html_table function to extract the tables you're interested in.

temp <- scotusURL %>% 
  html %>%
  html_nodes("table")

html_table(temp[1]) ## Just the "legend" table
html_table(temp[2]) ## THE MAIN TABLE

现在,如果要从结构基本相同的多个页面中导入数据,但可能只是更改了一些数字或某些内容,请尝试此方法.

Now, if you want to import data from multiple pages that have essentially the same structure, but maybe just change by some number or something, please try this method.

library(RCurl);library(XML)

pageNum <- seq(1:10)
url <- paste0("http://www.totaljobs.com/JobSearch/Results.aspx?Keywords=Leadership&LTxt=&Radius=10&RateType=0&JobType1=CompanyType=&PageNum=") 
urls <- paste0(url, pageNum) 

allPages <- lapply(urls, function(x) getURLContent(x)[[1]])
xmlDocs <- lapply(allPages, function(x) XML::htmlParse(x))