lxml.html通过搜索关键字来提取字符串
我有一部分html,如下所示
I have a portion of html like below
<li><label>The Keyword:</label><span><a href="../../..">The text</a></span></li>
我想获取字符串关键字:文本".
I want to get the string "The keyword: The text".
我知道我可以使用Chrome inspect或FF firebug获取以上html的xpath,然后选择(xpath).extract(),然后剥离html标签以获取字符串.但是,由于xpath在不同页面之间不一致,因此该方法不够通用.
I know that I can get xpath of above html using Chrome inspect or FF firebug, then select(xpath).extract(), then strip html tags to get the string. However, the approach is not generic enough since the xpath is not consistent across different pages.
因此,我正在考虑以下方法: 首先,使用(该代码用于草率化的HtmlXPathSelector,因为我不确定如何在lxml.html中执行相同的操作)搜索关键字:"
Hence, I'm thinking of below approach: Firstly, search for "The Keyword:" using (the code is for scrapy HtmlXPathSelector as I am not sure how to do the same in lxml.html)
hxs = HtmlXPathSelector(response)
hxs.select('//*[contains(text(), "The Keyword:")]')
何时进行pprint我会得到一些回报:
When do pprint I get some return:
>>> pprint( hxs.select('//*[contains(text(), "The Keyword:")]') )
<HtmlXPathSelector xpath='//*[contains(text(), "The Keyword:")]' data=u'<label>The Keyword:</label>'>
我的问题是如何获取所需的字符串:关键字:文本".我正在考虑如何确定xpath,如果知道xpath,那么我当然可以获取所需的字符串.
My question is how to get the wanted string: "The keyword: The text". I am thinking of how to determine xpath, if xpath is known, then of course I can get the wanted string.
除了lxml.html之外,我不接受任何其他解决方案.
I am open to any solution other than lxml.html.
谢谢.
from lxml import html
s = '<li><label>The Keyword:</label><span><a href="../../..">The text</a></span></li>'
tree = html.fromstring(s)
text = tree.text_content()
print text