在< br/>之前获取文本python/bs4

问题描述:

我正在尝试从一个网页上抓取一些数据.标签文本中包含换行符和<br/>标签.我只想在标签的开头获得电话号码.您能给我一个建议如何只获取号码吗?

I'm trying to scrape some data from one web page. There are newlines and <br/> tags in the tag text. I want to get only the telephone number on the beginning of the tag. Will you give me an advice how to get only the number?

这是HTML代码:

<td>
    +421 48/471 78 14



    <br />
    <em>(bowling)</em>
</td>

beautifulsoup中是否有一种方法可以在标签中获取文本,但只能获取文本,而该文本不会被其他标签包围?第二件事:摆脱文本换行符和html换行符?

Is there a way in beautifulsoup to get a text in a tag, but only that text, which is not surrounded by other tags? And the second thing: to get rid of text newlines and html newlines?

我使用BS4.

输出为:'+421 48/471 78 14'

The output would be: '+421 48/471 78 14'

您有什么想法吗? 谢谢

Have you any ideas? Thank you

html="""
<td>
    +421 48/471 78 14



    <br />
    <em>(bowling)</em>
</td>
"""


from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

print soup.find("td").contents[0].strip() 
+421 48/471 78 14

print soup.find("td").next_element.strip()
+421 48/471 78 14

soup.find("td").contents[0].strip()查找tag的内容,我们将得到tag的第一个元素,并使用str.strip()删除所有\n换行符.

soup.find("td").contents[0].strip() finds the contents of the tag which we get the first element of and remove all the \n newline chars with str.strip().

从文档 next_element :

字符串或标签的.next_element属性指向之后立即解析的内容