为什么不能使用lxml.html解析target.html中的所有div元素?

问题描述:

请在保管箱中下载文件,并将其另存为/tmp/target.html.

Please download the file in dropbox and save it as /tmp/target.html.

target.html

在带有firebug的firefox中打开它以查看html结构.

Open it in firefox with firebug to watch the html struture.

很明显,target.html中至少有10格. 现在,使用lxml.html解析target.html中的所有div元素.

It is clear that there are at least 10 div in target.html. Now to parse all div elements in the target.html with lxml.html.

python3
>>> import lxml.html
>>> doc=lxml.html.parse("/tmp/target.html")
>>> divs=doc.xpath("//div")
>>> len(divs)
4

获取结果4,为什么上面的代码无法解析这么多的div?
target.html中至少有10个div. target.html中的解析表也是如此.
target.html中至少有9个表,请使用firebug进行检查.

Get the result 4,why so many divs can't be parsed with above code?
At lease 10 divs in the target.html. Same thing for parsing table in target.html too.
There are at least 9 tables in target.html,please check it with firebug.

python3
>>> import lxml.html
>>> doc=lxml.html.parse("/tmp/target.html")
>>> tables=doc.xpath("//table")
>>> len(tables)
3

感谢sideshowbarker.

Thank to sideshowbarker.

sudo pip3 install  html5lib

首先要使用pip安装html5lib.

To install html5lib with pip at first.

import html5lib; 
doc = html5lib.parse(open('/tmp/target.html', 'rb'), treebuilder='lxml', namespaceHTMLElements=False); 
divs=doc.xpath('//div'); 
tables=doc.xpath('//table');
print(len(divs));
print(len(tables));