如何将文字和图片拼凑在一起?
问题描述:
我正在使用beautifulSoup4开发网页抓取工具.我想获取文章的文本和图像,但是有一些问题! html代码是这样的:
I'm working on a webpage scraper with beautifulSoup4. I want to get text and images of the article, but have some problems! html code is sth like this:
<div>
some texts1
<br />
<img src="imgpic.jpg" />
<br />
some texts2
</div>
我得到了全文:
post_soup.get_text()
并照常使用urllib2
将所有图像保存在div
中
最后我将它们保存在html页面中,然后将所有文本放在顶部,最后放置图像,但是我想将它们保存在新的html页面中,就像我抓取它们的页面一样,我的意思是先some texts1
然后image
然后
and save all images in div
with urllib2
as usual
finally I save them in a html page and put all text at top and images at last, but I want to save them in new html page just like the page I scraped them, I mean first some texts1
then image
then some texts2
有什么建议吗?
答
这不是最佳和正确的方法,但是应该可以:
This is not the best and correct way, but it should work:
from bs4 import BeautifulSoup
html = "<div>\
some texts1\
<br />\
<img src=\"imgpic.jpg\" />\
<br />\
some texts2\
</div>"
soup = BeautifulSoup(html)
text = "+".join(soup.stripped_strings).split("+")
print text[0]
print soup.find("img")['src']
print text[1]
输出:
some texts1
imgpic.jpg
some texts2