如何使用rowspan和colspan解析表

问题描述:

首先,我已阅读解析具有rowpan和colspan的表.我什至回答了这个问题.在将其标记为重复之前,请先阅读.

First, I have read Parsing a table with rowspan and colspan. I even answered the question. Please read before you mark this as duplicate.

<table border="1">
  <tr>
    <th>A</th>
    <th>B</th>
  </tr>
  <tr>
    <td rowspan="2">C</td>
    <td rowspan="1">D</td>
  </tr>
  <tr>
    <td>E</td>
    <td>F</td>
  </tr>
  <tr>
    <td>G</td>
    <td>H</td>
  </tr>
</table>

它将呈现为

+---+---+---+
| A | B |   |
+---+---+   |
|   | D |   |
+ C +---+---+
|   | E | F |
+---+---+---+
| G | H |   |
+---+---+---+

<table border="1">
  <tr>
    <th>A</th>
    <th>B</th>
  </tr>
  <tr>
    <td rowspan="2">C</td>
    <td rowspan="2">D</td>
  </tr>
  <tr>
    <td>E</td>
    <td>F</td>
  </tr>
  <tr>
    <td>G</td>
    <td>H</td>
  </tr>
</table>

但是,它将呈现为这样.

However, this will render like this.

+---+---+-------+
| A | B |       |
+---+---+-------+
|   |   |       |
| C | D +---+---+
|   |   | E | F |
+---+---+---+---+
| G | H |       |
+---+---+---+---+

我先前回答中的代码只能解析具有在第一行中定义的所有列的表.

My code from previous answer can only parse table which has all the columns defined in the first row.

def table_to_2d(table_tag):
    rows = table_tag("tr")
    cols = rows[0](["td", "th"])
    table = [[None] * len(cols) for _ in range(len(rows))]
    for row_i, row in enumerate(rows):
        for col_i, col in enumerate(row(["td", "th"])):
            insert(table, row_i, col_i, col)
    return table


def insert(table, row, col, element):
    if row >= len(table) or col >= len(table[row]):
        return
    if table[row][col] is None:
        value = element.get_text()
        table[row][col] = value
        if element.has_attr("colspan"):
            span = int(element["colspan"])
            for i in range(1, span):
                table[row][col+i] = value
        if element.has_attr("rowspan"):
            span = int(element["rowspan"])
            for i in range(1, span):
                table[row+i][col] = value
    else:
        insert(table, row, col + 1, element)

soup = BeautifulSoup('''
    <table>
        <tr><th>1</th><th>2</th><th>5</th></tr>
        <tr><td rowspan="2">3</td><td colspan="2">4</td></tr>
        <tr><td>6</td><td>7</td></tr>
    </table>''', 'html.parser')
print(table_to_2d(soup.table))

我的问题是如何将表解析为2D数组,以精确表示 在浏览器中的呈现方式.或者有人可以解释浏览器如何呈现表也可以.

My question is how to parse a table into a 2D array which represent exactly how it render in browser. Or someone can explain how the browser renders the table is also fine.

您不能只计算tdth个单元格,不.您必须对整个表进行扫描,以获取每一行的列数,并将上一行中任何活动的行跨度添加到该计数中.

You can't just count td or th cells, no. You'll have to do a scan across the table to get the number of columns on each row, adding to that count any active rowspans from a preceding row.

不同的情况下,分析具有行距的表我跟踪每个列号的行距计数,以确保来自不同单元格的数据结束在正确的列中这里可以使用类似的技术.

In a different scenario parsing a table with rowspans I tracked rowspan counts per column number to ensure that data from different cells ended up in the correct column. A similar technique can be used here.

第一计数列;只保留最高的数字.保留行数为2或更大的列表,并为处理的每一行列中的每行减去1.这样,您就知道每一行有多少额外"列.以最高的列数来构建输出矩阵.

First count columns; keep only the highest number. Keep a list of rowspan numbers of 2 or greater and subtract 1 from each for every row of columns you process. That way you know how many 'extra' columns there are on each row. Take the highest column count to build your output matrix.

接下来,再次遍历行和单元格,这次跟踪从列号到活动计数的字典中的行跨度.同样,将值大于等于2的任何内容都保留到下一行.然后移动列号以说明活动的任何行跨度;如果第0列上有一个活动的行,则该行中的第一个td实际上将是第二个.

Next, loop over the rows and cells again, and this time track rowspans in a dictionary mapping from column number to active count. Again, cary over anything with a value of 2 or over to the next row. Then shift column numbers to account for any rowspans that are active; the first td in a row would actually be the second if there was a rowspan active on column 0, etc.

您的代码将复制的列和行的值重复复制到输出中;我通过在给定单元格的colspanrowspan数字上创建一个循环(每个循环默认为1)来多次复制该值,从而达到了相同的目的.我忽略了重叠的单元格; HTML表规范指出重叠的单元格是一个错误,这取决于用户代理来解决冲突.在下面的代码中,colspan胜过rowpan单元格.

Your code copies the value for spanned columns and rows into the output repeatedly; I achieved the same by creating a loop over the colspan and rowspan numbers of a given cell (each defaulting to 1) to copy the value multiple times. I’m ignoring overlapping cells; the HTML table specifications state that overlapping cells are an error and it is up to the user agent to resolve conflicts. In the code below colspan trumps rowspan cells.

from itertools import product

def table_to_2d(table_tag):
    rowspans = []  # track pending rowspans
    rows = table_tag.find_all('tr')

    # first scan, see how many columns we need
    colcount = 0
    for r, row in enumerate(rows):
        cells = row.find_all(['td', 'th'], recursive=False)
        # count columns (including spanned).
        # add active rowspans from preceding rows
        # we *ignore* the colspan value on the last cell, to prevent
        # creating 'phantom' columns with no actual cells, only extended
        # colspans. This is achieved by hardcoding the last cell width as 1. 
        # a colspan of 0 means "fill until the end" but can really only apply
        # to the last cell; ignore it elsewhere. 
        colcount = max(
            colcount,
            sum(int(c.get('colspan', 1)) or 1 for c in cells[:-1]) + len(cells[-1:]) + len(rowspans))
        # update rowspan bookkeeping; 0 is a span to the bottom. 
        rowspans += [int(c.get('rowspan', 1)) or len(rows) - r for c in cells]
        rowspans = [s - 1 for s in rowspans if s > 1]

    # it doesn't matter if there are still rowspan numbers 'active'; no extra
    # rows to show in the table means the larger than 1 rowspan numbers in the
    # last table row are ignored.

    # build an empty matrix for all possible cells
    table = [[None] * colcount for row in rows]

    # fill matrix from row data
    rowspans = {}  # track pending rowspans, column number mapping to count
    for row, row_elem in enumerate(rows):
        span_offset = 0  # how many columns are skipped due to row and colspans 
        for col, cell in enumerate(row_elem.find_all(['td', 'th'], recursive=False)):
            # adjust for preceding row and colspans
            col += span_offset
            while rowspans.get(col, 0):
                span_offset += 1
                col += 1

            # fill table data
            rowspan = rowspans[col] = int(cell.get('rowspan', 1)) or len(rows) - row
            colspan = int(cell.get('colspan', 1)) or colcount - col
            # next column is offset by the colspan
            span_offset += colspan - 1
            value = cell.get_text()
            for drow, dcol in product(range(rowspan), range(colspan)):
                try:
                    table[row + drow][col + dcol] = value
                    rowspans[col + dcol] = rowspan
                except IndexError:
                    # rowspan or colspan outside the confines of the table
                    pass

        # update rowspan bookkeeping
        rowspans = {c: s - 1 for c, s in rowspans.items() if s > 1}

    return table

这将正确解析您的示例表:

This parses your sample table correctly:

>>> from pprint import pprint
>>> pprint(table_to_2d(soup.table), width=30)
[['1', '2', '5'],
 ['3', '4', '4'],
 ['3', '6', '7']]

并处理您的其他示例;第一张桌子:

and handles your other examples; first table:

>>> table1 = BeautifulSoup('''
... <table border="1">
...   <tr>
...     <th>A</th>
...     <th>B</th>
...   </tr>
...   <tr>
...     <td rowspan="2">C</td>
...     <td rowspan="1">D</td>
...   </tr>
...   <tr>
...     <td>E</td>
...     <td>F</td>
...   </tr>
...   <tr>
...     <td>G</td>
...     <td>H</td>
...   </tr>
... </table>''', 'html.parser')
>>> pprint(table_to_2d(table1.table), width=30)
[['A', 'B', None],
 ['C', 'D', None],
 ['C', 'E', 'F'],
 ['G', 'H', None]]

第二个:

>>> table2 = BeautifulSoup('''
... <table border="1">
...   <tr>
...     <th>A</th>
...     <th>B</th>
...   </tr>
...   <tr>
...     <td rowspan="2">C</td>
...     <td rowspan="2">D</td>
...   </tr>
...   <tr>
...     <td>E</td>
...     <td>F</td>
...   </tr>
...   <tr>
...     <td>G</td>
...     <td>H</td>
...   </tr>
... </table>
... ''', 'html.parser')
>>> pprint(table_to_2d(table2.table), width=30)
[['A', 'B', None, None],
 ['C', 'D', None, None],
 ['C', 'D', 'E', 'F'],
 ['G', 'H', None, None]]

最后但并非最不重要的一点是,代码正确地处理了超出实际表的范围,并且"0"范围(扩展到了末尾),如以下示例所示:

Last but not least, the code correctly handles spans that extend beyond the actual table, and "0" spans (extending to the ends), like in the following example:

<table border="1">
  <tr>
    <td rowspan="3">A</td>
    <td rowspan="0">B</td>
    <td>C</td>
    <td colspan="2">D</td>
  </tr>
  <tr>
    <td colspan="0">E</td>
  </tr>
</table>

有两行,每行4个单元格,即使rowpan和colspan值会让您相信可能会有3和5:

There are two rows of 4 cells, even though the rowspan and colspan values would have you believe there could be 3 and 5:

+---+---+---+---+
|   |   | C | D |
| A | B +---+---+
|   |   |   E   |
+---+---+-------+

这种跨接的处理方式与浏览器的处理方式相同;它们将被忽略,并且0跨度将扩展到剩余的行或列:

Such overspanning is handled just like the browser would; they are ignored, and the 0 spans extend to the remaining rows or columns:

>>> span_demo = BeautifulSoup('''
... <table border="1">
...   <tr>
...     <td rowspan="3">A</td>
...     <td rowspan="0">B</td>
...     <td>C</td>
...     <td colspan="2">D</td>
...   </tr>
...   <tr>
...     <td colspan="0">E</td>
...   </tr>
... </table>''', 'html.parser')
>>> pprint(table_to_2d(span_demo.table), width=30)
[['A', 'B', 'C', 'D'],
 ['A', 'B', 'E', 'E']]