为什么lxml.html有时吞下/删除空白而不是保留空白?

问题描述:

给出以下代码,可以合理地期望将输入lxml的几乎完全相同的HTML字符串吐出来.

Given the following code, one might reasonably expect almost the exact same string of HTML that was fed into lxml to be to spit back out.

from lxml import html

HTML_TEST_STRING = r"""
<pre>
<em>abc</em>

<em>def</em>

<sub>ghi</sub>

<sub>jkl</sub>

<em>mno</em>

<em>pqr</em>

</pre>
"""

parser = html.HTMLParser( remove_blank_text=False )
doc = html.fromstring( HTML_TEST_STRING, parser=parser )
print( html_out_string )

相反,即使所有内容都包含在<pre>预格式化的代码块中,并且remove_blank_text标志设置为False它也只考虑为 some ,但神秘地不是.请参见以下代码的意外输出:

Instead, even though everything is contained within a <pre> pre-formatted code block, and the remove_blank_text flag is set to False, it only respects the preservation of whitespace for some of the contents, yet mysteriously not for other parts of the content. See the unexpected output of the above code below:

<pre>
<em>abc</em>

<em>def</em>

<sub>ghi</sub><sub>jkl</sub><em>mno</em>

<em>pqr</em>

</pre>

具体来说,每当lxml遇到<sub>标记时,它都会变得笨拙并丢失紧随sub元素(的"tail"文本内容),即使该"sub元素"甚至可以说不是一个元素,因为它包裹在pre元素中.)

Specifically, whenever lxml encounters a <sub> tag, it goes batty and loses the "tail" text content that follows that sub element (even when that "sub element" arguably isn't even an element—since it's wrapped in a pre element).

这种奇怪行为的最可能的催化剂是,就像我一样,您在Windows上,并且使用的是lxml不发布二进制包的Python版本

The most likely catalyst for this curious behavior is that, like me, you're on Windows and using a Python version that lxml doesn't publish a binary package for.

在这种情况下, lxml网站的一部分将您指向 libxml2的官方非官方Windows二进制文件,以便您[潜在地通过pip安装脚本]可以构建一个新的lxml二进制文件,该文件支持您的 Python版本.但是,问题在于,它链接到您的二进制文件至少有4年的历史,并且包含您遇到的错误.

In such a scenario, one portion of the lxml website points you to the official unofficial Windows binaries for libxml2 so that you [potentially via the pip install script] can build a new lxml binary that supports your Python version. The problem, however, is that the binaries that it links you to are at least 4 years old and contain the bug you're running into.

此问题最简单的解决方法是改为下载然后安装 Christoph Gohlke的非官方lxml的二进制归档文件(所谓的轮子"),实际上是为您的OS/Python变体而构建的. (lxml网站的另一部分也建议这样做,但是如果您像我一样,则忽略了该路径,希望尽可能少地运行非正式的二进制代码.)

The easiest solution to this problem is to instead download and then install Christoph Gohlke's unofficial binary archive (a so called "wheel") of lxml that is actually built for your OS/Python variant. (Another section of the lxml website also recommends this, but if you're like me, you ignored that path, wanting to run as little unofficial binary code as reasonably possible.)

(例如pip3 install --upgrade lxml-3.5.0-cp35-none-win32.whl)

Golke的软件包是使用libxml2的较新版本构建的,该版本显然已经修复了该错误,因此,如果上述所有操作均正常进行,您现在就可以避免浪费大量时间来浪费错误的树". 您不是在错误地使用lxml,不是在这种情况下lxml不支持保留空白 (因为您可能会想到许多其他的SO条目)这只是您不经意地使用的libxml2版本已修复了一个错误.

Golke's package is built using a more recent version of libxml2 which has apparently already fixed that bug, so if everything above worked properly, you can now stop wasting hours of your life barking up the wrong 'tree'. You're not using lxml wrong, and it's not that lxml doesn't support preserving whitespace in this scenario (as so many other SO entries might have you think); it's just that you were unwittingly using a version of libxml2 that has a bug that's since been fixed.

使用最新版本的libxml2构建驱动您的lxml安装,您发布的示例代码的输出将产生预期的结果(始终保留空白):

With a recent build of libxml2 driving your lxml installation, the output of the sample code you posted will instead produce what you expected (consistently preserved whitespace):

<pre>
<em>abc</em>

<em>def</em>

<sub>ghi</sub>

<sub>jkl</sub>

<em>mno</em>

<em>pqr</em>

</pre>