如何在Python中使用Beautifulsoup抓取结构不好的html表?

问题描述:

本网站 https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1_12.html 似乎有一个组织不好的html表.表格单元格的唯一标识符是每个tr标签内的宽度.我要抓取全部60页的信息.如何找到适当地刮擦每一行表格的方法?我知道标头的大小为10列,但是由于对于某些tr标签,我有5个td标签,而对于另一些我具有或多或少的td标签,根据其准确地抓取数据并不容易列.

This website https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html seems have a not well organized html table. the only identifier of table cells are width inside each tr tag. I want to scrape the information of all 60 pages. How I can find a way to scrape each row of tables appropriately? I know the size of header is 10 columns but since for some tr tags, I have 5 td tags and for some other I have more or less td tags, it's not easy to exactly scrape the data according to its column.

在这里您可以看到一部分代码,该代码仅提取与一行相关的数据,而没有保留空单元格的空值.

Here you can see a part of code which is extracting just data related to one row but not with keeping the null values for empty cells.

soup = BeautifulSoup(page.content, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[0] # Grab the first table
new_table = pd.DataFrame(columns=range(0,10), index = [0]) # I know the size
row_marker = 0
for row in table.find_all('tr'):
     column_marker = 0
     columns = row.find_all('td')
     for column in columns:
           new_table.iat[row_marker,column_marker] = column.get_text()
           column_marker += 1

这是我从这段代码中得到的输出(将所有值连续放入行中,它们之间没有任何间隙):

It's the output which I have from this code (putting all values in a row without any gaps between them):

     0     1     2                  3     4   5    6    7    8    9  

0  62.00    PACL  Palaeocene Claystones  SWAP  NaN  NaN  NaN  NaN  NaN

但是实际输出应该是这样的:

but the real output should be something like this:

   0        1    2   3                        4   5    6    7    8    9  

0  62.00   NaN NaN  PACL  Palaeocene Claystones  NaN  NaN  NaN  NaN  SWAP

我使用注释中提到的方法(使用宽度)来确定数据中的空值.这是Python代码:

I've used the method I mentioned in the comments (using width) to determine the null values in the data. Here's the Python code:

import requests                                                                                                                                                                                                                  
import bs4                                                                                                                                                                                                                       

URL = 'https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html'                                                                                                           

response = requests.get(URL)                                                                                                                                                                                                     
soup = bs4.BeautifulSoup(response.text, 'lxml')                                                                                                                                                                                  

tables = soup.find_all('table')                                                                                                                                                                                                  
count = 0                                                                                                                                                                                                                        
cells_count = 0                                                                                                                                                                                                                  

for table in tables:                                                                                                                                                                                                             
        count +=1                                                                                                                                                                                                                
        if count >2:                                                                                                                                                                                                             
                row = table.tr                                                                                                                                                                                                   
                cells = row.find_all('td')                                                                                                                                                                                       
                print ''                                                                                                                                                                                                         
                x = 0                                                                                                                                                                                                            
                width_diff = 0                                                                                                                                                                                                   
                cell_text = []                                                                                                                                                                                                   
                for cell in cells:                                                                                                                                                                                               
                        width = cell.get('width')                                                                                                                                                                                
                        if int(width) < 10:                                                                                                                                                                                      
                                continue                                                                                                                                                                                         

                        if width_diff > 0:                                                                                                                                                                                       
                                cell_text.append('NaN ')                                                                                                                                                                         
                                if width_diff > 50:                                                                                                                                                                              
                                        x += 2                                                                                                                                                                                   
                                        cell_text.append('Nan ')                                                                                                                                                                 
                                else:
                                        x += 1
                                width_diff = 0

                        if x == 0 or x == 1 or x == 2 or x == 3 or x == 4 or x == 6:
                                width_range = [35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
                        elif x == 5:
                                width_range = [220,221,222,223,224,225,226,227,228,229,230]
                        elif x == 7:
                                width_range = [136]


                        if cell.text:
                                cell_text.append(cell.text.strip() + ' ')
                        else:
                                cell_text.append('NaN ')

                        if int(width) not in width_range:
                                width_diff = int(width) - width_range[-1]
                        x += 1
                        #print x,
                length = len(cell_text)
                for i in range(0, length):
                        print cell_text[i],
                diff = 9 - length
                if diff > 0:
                        for j in range(0, diff):
                                print 'NaN ',

如您所见,我注意到每列都使用了一定的宽度范围.通过将每个像元与其假定的宽度进行比较,我们可以确定它需要占用多少空间.如果宽度差异太大,则意味着它占用了下两个单元格的空间.

As you can see, I've noticed that a certain width range is used in each column. By comparing each cell to its supposed width, we can determine how many spaces it takes. If the difference in width is too great, that means it takes the space of the next two cells.

可能需要进行一些改进,您需要针对所有URL测试脚本,以确保数据绝对干净.

It might need some refining, you'll need to test the script against all URLs to ensure that the data is absolutely clean.

这是运行此代码的示例输出:

Here's a sample output from running this code:

61.00  SED  TERT  WBDS  NaN  Woolwich Beds  GP  NaN  WLDB                                                                                                                                                                        
62.00  NaN  NaN  PACL  NaN  Palaeocene Claystones  NaN  Nan  SWAP                                                                                                                                                                
63.00  NaN  NaN  SMFC  NaN  Shallow Marine Facies  NaN  Nan  SONS                                                                                                                                                                
64.00  NaN  NaN  DMFC  NaN  Deep Marine Facies  NaN  NaN  NaN                                                                                                                                                                    
65.00  NaN  NaN  SLSY  NaN  Selsey Member  GN  NaN  WSXB                                                                                                                                                                         
66.00  NaN  NaN  MFM  NaN  Marsh Farm Member  NaN  NaN  NaN                                                                                                                                                                      
67.00  NaN  NaN  ERNM  NaN  Earnley Member  NaN  NaN  NaN                                                                                                                                                                        
68.00  NaN  NaN  WITT  NaN  Wittering Member  NaN  NaN  NaN                                                                                                                                                                      
69.00  NaN  NaN  WHI  NaN  Whitecliff Beds  GZ  NaN  NaN                                                                                                                                                                         
70.00  NaN  NaN  Nan  WFSM  NaN  Whitecliff Sand Member  NaN  Nan  GN                                                                                                                                                            
71.00  NaN  WESQ  NaN  Nan  Westray Group Equivalent  NL  GW  WESH                                                                                                                                                               
72.00  NaN  WESR  NaN  Nan  Westray Group  NM  GO  CNSB                                                                                                                                                                          
73.00  NaN  NaN  THEF  NaN  Thet Formation  NaN  Nan  MOFI                                                                                                                                                                       
74.00  NaN  NaN  SKAD  NaN  Skade Formation  NB  NaN  NONS                                                                                                                                                                       
75.00  NaN  NORD  NaN  Nan  Nordland  NP  Q  CNSB                                                                                                                                                                                
75.50  NaN  NaN  SWCH  NaN  Swatchway Formation  Q  NaN  MOFI                                                                                                                                                                    
75.60  NaN  NaN  CLPT  NaN  Coal Pit Formation  NaN  NaN  NaN                                                                                                                                                                    
75.70  NaN  NaN  LNGB  NaN  Ling Bank Formation  NaN  NaN  NaN                                                                                                                                                                   
76.00  NaN  NaN  SHKL  NaN  Shackleton Formation  GO  QP  ROCK                                                                                                                                                                   
77.00  NaN  NaN  UGNS  NaN  Upper Tertiary sands  NaN  NM  NONS                                                                                                                                                                  
78.00  NaN  NaN  CLSD  NaN  Claret Sand  NP  NaN  SVIG                                                                                                                                                                           
79.00  NaN  NaN  BLUE  NaN  Blue Sand  NaN  NaN  NaN                                                                                                                                                                             
80.00  NaN  NaN  ABGF  NaN  Aberdeen Ground Formation  QH  NaN  CNSB                                                                                                                                                             
81.00  NaN  NaN  NUGU  NaN  Upper Glauconitic Unit  NB  NA  MOFI                                                                                                                                                                 
82.00  NaN  NaN  POWD  NaN  Powder Sand  GN  NaN  SVIG                                                                                                                                                                           
83.00  NaN  NaN  BASD  NaN  Basin Sand  NaN  Nan  CNSB                                                                                                                                                                           
84.00  NaN  NaN  CRND  NaN  Crenulate Sand  NaN  NaN  NaN                                                                                                                                                                        
85.00  NaN  NaN  NORS  NaN  Nordland Sand  QP  NaN  SONS                                                                                                                                                                         
86.00  NaN  NaN  MIOS  NaN  Miocene Sand  NM  NaN  ESHB                                                                                                                                                                          
87.00  NaN  NaN  MIOL  NaN  Miocene Limestone  NaN  Nan  CNSB                                                                                                                                                                    
88.00  NaN  NaN  FLSF  NaN  Fladen Sand Formation  GP  GO  WYGG                                                                                                                                                                  

注意:我不知道示例第一个单元格中的0是如何创建的,因此我将其遗漏在答案之外.我不知道它是否也应该被刮掉,因为我在任何地方都找不到.

Note: I don't know how the 0 in the first cell of your example is created, so I left it out of the answer. I don't know if it's supposed to be scraped as well, because I didn't find it anywhere.