BeautifulSoup-合并连续标签

问题描述:

我必须使用最混乱的HTML,其中将各个单词拆分为单独的标签,如以下示例所示:

I have to work with the messiest HTML where individual words are split into separate tags, like in the following example:

<b style="mso-bidi-font-weight:normal"><span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span></b><b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span></b>

这很难读,但是基本上"INTRODUCTION"一词被分成了

That's kind of hard to read, but basically the word "INTRODUCTION" is split into

<b><span>I</span></b> 

<b><span>NTRODUCTION</span></b>

span和b标签具有相同的内联属性.

having the same inline properties for both span and b tags.

将这些结合起来的好方法是什么?我以为要遍历才能找到这样的连续b标签,但是我坚持如何合并连续b标签.

What's a good way to combine these? I figured I'd loop through to find consecutive b tags like this, but am stuck on how I'd go about merging the consecutive b tags.

for b in soup.findAll('b'):
    try:
       if b.next_sibling.name=='b':
       ## combine them here??
    except:
        pass

有什么想法吗?

预期的输出如下

<b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>INTRODUCTION</span></b>

也许您可以检查b.previousSibling是否为b标记,然后将当前节点的内部文本附加到该标记中.完成此操作后-您应该可以使用b.decompose从树中删除当前节点.

Perhaps you could check if the b.previousSibling is a b tag, then append the inner text from the current node into that. After doing this - you should be able to remove the current node from the tree with b.decompose.