如何将文字和图片拼凑在一起?

问题描述：

我正在使用beautifulSoup4开发网页抓取工具.我想获取文章的文本和图像，但是有一些问题！ html代码是这样的:

I'm working on a webpage scraper with beautifulSoup4. I want to get text and images of the article, but have some problems! html code is sth like this:

<div>
 some texts1
 <br />
 <img src="imgpic.jpg" />
 <br />
 some texts2
</div>

我得到了全文:

post_soup.get_text()

并照常使用urllib2将所有图像保存在div中最后我将它们保存在html页面中，然后将所有文本放在顶部，最后放置图像，但是我想将它们保存在新的html页面中，就像我抓取它们的页面一样，我的意思是先some texts1然后image然后

and save all images in div with urllib2 as usual finally I save them in a html page and put all text at top and images at last, but I want to save them in new html page just like the page I scraped them, I mean first some texts1 then image then some texts2

有什么建议吗?

答

这不是最佳和正确的方法，但是应该可以:

This is not the best and correct way, but it should work:

from bs4 import BeautifulSoup

html = "<div>\
 some texts1\
 <br />\
 <img src=\"imgpic.jpg\" />\
 <br />\
 some texts2\
</div>"

soup = BeautifulSoup(html)
text = "+".join(soup.stripped_strings).split("+")

print text[0]
print soup.find("img")['src']
print text[1]

输出:

some texts1
imgpic.jpg
some texts2

如何将文字和图片拼凑在一起?

相关推荐