需要使用BeautifulSoup和Selenium WebDriver帮助网络抓取表格

问题描述:

所以我正在尝试网络抓取 https://data.bls.gov/cgi-bin/surveymost?bls ,并能够弄清楚如何通过点击来爬网以获取表格.

So I am working on trying to webscrape https://data.bls.gov/cgi-bin/surveymost?bls and was able to figure out how to webcrawl through clicks to get to a table.

我正在实践的选择是在补偿"下选中与雇佣成本指数(ECI)平民(未调整)-CIU1010000000000A"相关联的复选框,然后选择检索数据".

The selection that I am practicing on is after you select the checkbox associated with " Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A" under Compensation and then select "Retrieve data".

处理完这两个表格后,将显示一个表格.这是我要刮擦的桌子.

Once those two are processed a table shows. This is the table I am trying to scrape.

下面是我目前拥有的代码.

Below is the code that I have as of right now.

请注意,您必须为我已<放置的浏览器驱动程序放置自己的路径.浏览器驱动程序>.

Note that you have to put your own path for your browser driver where I have put < browser driver >.

from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
import numpy as np
import requests
import lxml.html as lh

from selenium import webdriver
url = "https://data.bls.gov/cgi-bin/surveymost?bls"
ChromeSource = r"<browser driver>"

# Open up a Chrome browser and navigate to web page.
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless') # will run without opening browser.
driver = webdriver.Chrome(ChromeSource, chrome_options=options)
driver.get(url)

driver.find_element_by_xpath("//input[@type='checkbox' and @value = 'CIU1010000000000A']").click()
driver.find_element_by_xpath("//input[@type='Submit' and @value = 'Retrieve data']").click()

i = 2

def myTEST(i):
    xpath = '//*[@id="col' + str(i) + '"]'
    TEST = driver.find_elements_by_xpath(xpath)

    num_page_items = len(TEST)
    for i in range(num_page_items):
        print(TEST[i].text)
myTEST(i)

# Clean up (close browser once completed task).
driver.close() 

现在,仅查看标头.我也想获得表的内容.

Right now this only is looking at the headers. I would like to also get the table content as well.

如果我使i = 0,它将产生年份". i = 1,则产生句点".但是,如果我选择i = 2,则会得到两个变量,它们的估计值"和标准误差"具有相同的col2 id.

If I make i = 0, it produces "Year". i = 1, it produces "Period". But if I select i = 2 I get two variables which have the same col2 id for "Estimated Value" and "Standard Error".

我试图想出一种解决此问题的方法,但似乎无法获得我研究过的任何东西.

I tried to think of a way to work around this and can't seem to get anything that I have researched to work.

从本质上讲,最好从我单击并在感兴趣的表开始的地方开始,然后查看标题的xpath并为所有子目录拉入文本./p>

In essence, it would be better to start at the point where I am done clicking and am at the table of interest and then look at the xpath of the header and pull in the text for all of the sub 's.

<tr> == $0
  <th id="col0"> Year </th>
  <th id="col1"> Period </th>
  <th id="col2">Estimated Value</th>
  <th id="col2">Standard Error</th>
<tr>

我不确定该怎么做.我还尝试遍历{i},但显然与两个标头文本共享会导致问题.

I am not sure how to do that. I also tried to loop through the {i} but obviously sharing with two header text causes an issue.

一旦我能够获得标题,我就想要获得内容.我是否可以走上正确的道路,是否想得太多,或者是否有更简单的方法来完成所有这些工作,我可以为您提供一些见解.我正在学习,这是我第一次尝试使用硒库进行点击.我只是想让它工作,所以我可以在另一张桌子上再试一次,并使其尽可能自动化或可重用(通过调整).

Once I am able to get the header, I want to get the contents. I could you some insight on if I am on the right path, overthinking it or if there is a simpler way to do all of this. I am learning and this is my first attempt using the selenium library for clicks. I just want to get it to work so I can try it again on a different table and make it as automate or reusable (with tweaking) as possible.

实际上,您不需要selenium,您只需跟踪POST Form data,并将其应用于您的POST请求中即可.

Actually you don't need selenium, You can just track the POST Form data, and apply the same within your POST request.

然后,您可以轻松地使用Pandas加载表格.

Then you can load the table using Pandas easily.

import requests
import pandas as pd

data = {
    "series_id": "CIU1010000000000A",
    "survey": "bls"
}


def main(url):
    r = requests.post(url, data=data)
    df = pd.read_html(r.content)[1]
    print(df)


main("https://data.bls.gov/cgi-bin/surveymost")

说明:

  • 打开网站.
  • 选择Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A
  • 现在,您必须打开浏览器开发者工具并导航到Network Monitor部分. etc 按下 Ctrl + Shift + E ( Command + Option + E (在Mac上).
  • 现在您将发现完成的POST请求.

  • open the site.
  • Select Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A
  • Now you have to open your browser Developer Tools and navigate to Network Monitor section. etc Press  Ctrl + Shift + E ( Command + Option + E on a Mac).
  • Now you will found a POST request done.

导航到Params标签.

现在您可以发出POST请求.并且Table出现在HTML源文件中,并且没有通过JavaScript加载,因此您可以在bs4中对其进行解析,或者使用

Now you can make the POST request. and since the Table is presented within the HTML source and it's not loaded via JavaScript, so you can parse it within bs4 or read it in nice format using pandas.read_html()

注意:只要未通过JavaScript加载该表,就可以读取该表.否则,您可以尝试跟踪XHR请求(检查以前的 answer ),也可以使用selenium呈现JS,因为requests是一个HTTP库,无法为您呈现它.

Note: You can read the table as long as it's not loaded via JavaScript. otherwise you can try to track the XHR request (Check previous answer) or you can use selenium or requests_html to render JS since requests is an HTTP library which can't render it for you.