BeautifulSoup-合并连续标签
我必须使用最混乱的HTML,其中将各个单词拆分为单独的标签,如以下示例所示:
I have to work with the messiest HTML where individual words are split into separate tags, like in the following example:
<b style="mso-bidi-font-weight:normal"><span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span></b><b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span></b>
这很难读,但是基本上"INTRODUCTION"一词被分成了
That's kind of hard to read, but basically the word "INTRODUCTION" is split into
<b><span>I</span></b>
和
<b><span>NTRODUCTION</span></b>
span和b标签具有相同的内联属性.
having the same inline properties for both span and b tags.
将这些结合起来的好方法是什么?我以为要遍历才能找到这样的连续b标签,但是我坚持如何合并连续b标签.
What's a good way to combine these? I figured I'd loop through to find consecutive b tags like this, but am stuck on how I'd go about merging the consecutive b tags.
for b in soup.findAll('b'):
try:
if b.next_sibling.name=='b':
## combine them here??
except:
pass
有什么想法吗?
预期的输出如下
<b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>INTRODUCTION</span></b>
也许您可以检查b.previousSibling
是否为b
标记,然后将当前节点的内部文本附加到该标记中.完成此操作后-您应该可以使用b.decompose
从树中删除当前节点.
Perhaps you could check if the b.previousSibling
is a b
tag, then append the inner text from the current node into that. After doing this - you should be able to remove the current node from the tree with b.decompose
.