python爬虫之Beautiful Soup的基本使用

1、简介

　　简单来说，Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下：

　　Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

　　Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

　　Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

2、环境安装

　　Beautiful Soup 3 目前已经停止开发，推荐在现在的项目中使用Beautiful Soup 4，不过它已经被移植到BS4了，也就是说导入时我们需要 from bs4 import BeautifulSoup 。所以这里我们用的版本是 Beautiful Soup 4.3.2 (简称BS4)。

　　1、快速安装

pip install beautifulsoup4

　　2、如果想安装最新的版本，请直接下载安装包来手动安装，也是十分方便的方法

　　　　1、Beautiful Soup3.2.1

　　　　https://pypi.python.org/pypi/BeautifulSoup/3.2.1

　　　　2、Beautiful Soup4.3.2

　　　 https://pypi.python.org/pypi/beautifulsoup4/

　　　　下载完成之后解压

　　　　运行下面的命令即可完成安装

　　　　python setup.py install

　　3、然后需要安装 lxml

　　　pip install lxml

　　　另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:

　　　pip install html5lib

　　 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐安装。

　　 python爬虫之Beautiful Soup的基本使用

3、使用

　　官方文档：http://beautifulsoup.readthedocs.io/zh_CN/latest/

　　1、导入

from bs4 import BeautifulSoup

　　2、我们首先创建一个html文件，为了模拟下面的操作。

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" ><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" >Lacie</a> and
<a href="http://example.com/tillie" class="sister" >Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

　　3、创建 beautifulsoup 对象

soup = BeautifulSoup(html)

　　另外，我们还可以打开本地的html文件。

soup = BeautifulSoup(open('index.html'))

　　4、格式化输入

soup = BeautifulSoup(html,"lxml")
print(soup.prettify())

　　输出：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>

　　这个很有用的哦，如果我们要分析本地的html文件没有格式化输出的时候，看起来就非常乱了，所以我们需要格式化输入后我们就能一目了然这个html文件的结构。

　　5、四大对象种类

　　Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种

　　（1）`Tag`

　　`（2）NavigableString`

　　`（3）BeautifulSoup`

　　`（4）Comment`

（1）Tag

tag是什么鬼，tag中文意思是标签的意思，学过html的同学肯定明白，标签例如<a href="https://www.baidu.com">my name a</a>

　　感受一下tag的用法

print(soup.title)
#<title>The Dormouse's story</title>

print(soup.head)
#<head><title>The Dormouse's story</title></head>

　　细心的同学会发现，我有很多p标签，但是只能打印到从上往下的第一个匹配到的标签

print(soup.p)
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

print(soup.a)
#<a class="sister" href="http://example.com/elsie" ><!-- Elsie --></a>

soup = BeautifulSoup(html,"lxml")
print(soup.p)
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

　　tag还有两个常用的属性，name和attrs

　　name:　

soup = BeautifulSoup(html,"lxml")
print(soup.name)
print(soup.head.name)
#[document]
#head

　　attrs:　

soup = BeautifulSoup(html,"lxml")
print(soup.p.attrs)
#{'name': 'dromouse', 'class': ['title']}

　　获取属性值的两种不同方法：　　

soup = BeautifulSoup(html,"lxml")
print(soup.p.attrs)
print(soup.p.get("class"))
print(soup.p["class"])
#{'class': ['title'], 'name': 'dromouse'}
#['title']
#['title']

　　可以获取，当然也可以修改和删除

　　修改：

soup = BeautifulSoup(html,"lxml")
print(soup.p)
soup.p["class"]="newclass"
print(soup.p)
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
#<p class="newclass" name="dromouse"><b>The Dormouse's story</b></p>

　　删除：　

soup = BeautifulSoup(html,"lxml")
print(soup.p)
del soup.p["class"]
print(soup.p)
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
#<p name="dromouse"><b>The Dormouse's story</b></p>

（2）NavigableString

　　1、我们已经通过tag方法找到标签，但是如果想找某个标签的内容怎么办。

soup = BeautifulSoup(html,"lxml")
print(soup.p.string)
#The Dormouse's story

（3）BeautifulSoup

对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性。

soup = BeautifulSoup(html,"lxml")
print(type(soup.name))
print(soup.name)
print(soup.attrs)
#<class 'str'>
#[document]
#{}

（4）Comment

　　Comment 对象是一个特殊类型的 NavigableString 对象，其实输出的内容仍然不包括注释符号，但是如果不好好处理它，可能会对我们的文本处理造成意想不到的麻烦。

soup = BeautifulSoup(html,"lxml")
print(soup.a)
print(soup.a.string)
print(type(soup.a.string))

#<a class="sister" href="http://example.com/elsie" ><!-- Elsie --></a>
#Elsie 
#<class 'bs4.element.Comment'>

　　a 标签里的内容实际上是注释，但是如果我们利用 .string 来输出它的内容，我们发现它已经把注释符号去掉了，所以这可能会给我们带来不必要的麻烦。

　　另外我们打印输出下它的类型，发现它是一个 Comment 类型，所以，我们在使用前最好做一下判断，判断代码如下

import bs4
soup = BeautifulSoup(html,"lxml")
if type(soup.a.string)==bs4.element.Comment:
    print(soup.a.string)

　　（6）遍历文档树

　　contents和children的区别

　　1、contents

　　tag 的 .content 属性可以将tag的子节点以列表的方式输出

soup = BeautifulSoup(html,"lxml")
print(soup.p.contents)
#[<b>The Dormouse's story</b>]

　　列表的话我们就可以通过下标取里面值

soup = BeautifulSoup(html,"lxml")
print(soup.p.contents[0])
#<b>The Dormouse's story</b>

　　2、children　

　　它返回的不是一个 list，不过我们可以通过遍历获取所有子节点。

　　我们打印输出 .children 看一下，可以发现它是一个 list 生成器对象　

soup = BeautifulSoup(html,"lxml")
print(soup.p.children)
#<list_iterator object at 0x01BAE310>

　　list可以通过for循环遍历取值

soup = BeautifulSoup(html,"lxml")
print(soup.p.children)
for line in soup.p.children:
    print(line)

　　3、所有子孙节点（.descendants）

.contents 和 .children 属性仅包含tag的直接子节点.例如,<head>标签只有一个直接子节点<title>

　　.descendants

soup = BeautifulSoup(html,"lxml")
for line in soup.descendants:
    print(line)

　　children和contents只会把html文件打印一遍，只是children需要用for循环遍历一下而已，但是descendantes会把html中每一个tag都遍历一遍的前提是子子孙孙都会遍历一下（有些朋友可能还是有点不明白，直接上代码你就懂了）

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie<span>Test<a>TEST</a></span></a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie<span>Test<a>TEST</a></span></a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie<span>Test<a>TEST</a></span></a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie<span>Test<a>TEST</a></span></a>
Tillie
<span>Test<a>TEST</a></span>
Test
<a>TEST</a>
TEST
;
and they lived at the bottom of a well.


<p class="story">...</p>
...

descendants

　　通俗点说就是：如果一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容（如果标签里面有很多很多的内容，它就不知道该找谁了，结果返回一个None）

　　可能有些同学会说，没关系啊，内容多了你可以用for循环遍历一下不就成了么，那我们来试试。

　　还可以这么说，比如，你找到了一个a标签，但是这个a标签有自己的内容，a标签下面还有一个a标签或者别的标签，这个a标签也有自己的内容，这个时候你要是用string的话肯定是None。

"The Dormouse's story" ' ' ' ' "The Dormouse's story" ' ' 'Once upon a time there were three little sisters; and their names were ' ', ' 'Lacie' ' and ' 'Tillie' 'Test' 'TEST' '; and they lived at the bottom of a well.' ' ' '...' ' '

"The Dormouse's story" "The Dormouse's story" 'Once upon a time there were three little sisters; and their names were' ',' 'Lacie' 'and' 'Tillie' 'Test' 'TEST' '; and they lived at the bottom of a well.' '...'

<body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" ></a>, <a class="sister" href="http://example.com/lacie" >Lacie</a> and <a class="sister" href="http://example.com/tillie" >TillieTest<a>TEST</a></a>; and they lived at the bottom of a well. ... </body>

Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">TillieTest<a>TEST</a></a>; and they lived at the bottom of a well.

, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">TillieTest<a>TEST</a></a> ; and they lived at the bottom of a well.

[The Dormouse's story, Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well., <a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ...]

soup = BeautifulSoup(html,"lxml") print(soup.find_all(text=" Elsie ")) # [' Elsie '] print(soup.find_all(text=["Tillie", " Elsie ", "Lacie"])) #[' Elsie ', 'Lacie', 'Tillie'] print(soup.find_all(text=re.compile("Dormouse"))) #["The Dormouse's story", "The Dormouse's story"]

soup = BeautifulSoup(html,"lxml") print(soup.find_all('p')) print(soup.find_all('p',recursive=False)) [The Dormouse's story, Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" ></a>, <a class="sister" href="http://example.com/lacie" >Lacie</a> and <a class="sister" href="http://example.com/tillie" >Tillie</a>; and they lived at the bottom of a well., ...] []

soup = BeautifulSoup(html,"lxml") print(soup.find_all('a')) print(soup.find('a')) #[<a class="sister" href="http://example.com/elsie" >Tillie</a>] #<a class="sister" href="http://example.com/elsie" ></a>

soup = BeautifulSoup(html, 'lxml') print(type(soup.select('title'))) #<class 'list'> print(soup.select('title')[0].get_text()) #The Dormouse's story for title in soup.select('title'): print(title.get_text()) #The Dormouse's story

python爬虫之Beautiful Soup的基本使用

1、简介

2、环境安装

3、使用

（1）`Tag`

`（2）NavigableString`

`（3）BeautifulSoup`

`（4）Comment`

（1）Tag

1、find_all( name , attrs , recursive , text , **kwargs )

6、一大波find操作

1、find( name , attrs , recursive , text , **kwargs )

2、find_next_siblings() find_next_sibling()

3、find_previous_siblings() find_previous_sibling()

4、find_all_next() find_next()

5、find_all_previous() 和 find_previous()

（1）通过标签名查找

（2）通过类名查找

（3）通过 id 名查找

（4）组合查找

（5）属性查找

python爬虫之Beautiful Soup的基本使用

1、简介

2、环境安装

3、使用

（1）Tag

（2）NavigableString

（3）BeautifulSoup

（4）Comment

（1）Tag

1、find_all( name , attrs , recursive , text , **kwargs )

6、一大波find操作

1、find( name , attrs , recursive , text , **kwargs )

2、find_next_siblings() find_next_sibling()

3、find_previous_siblings() find_previous_sibling()

4、find_all_next() find_next()

5、find_all_previous() 和 find_previous()

（1）通过标签名查找

（2）通过类名查找

（3）通过 id 名查找

（4）组合查找

（5）属性查找

相关推荐

　　（1）`Tag`

　　`（2）NavigableString`

　　`（3）BeautifulSoup`

　　`（4）Comment`

　　1、find_all( name , attrs , recursive , text , **kwargs )

　　6、一大波find操作

　　1、find( name , attrs , recursive , text , **kwargs )

　　2、find_next_siblings() find_next_sibling()

　　3、find_previous_siblings() find_previous_sibling()

　　4、find_all_next() find_next()

　　5、find_all_previous() 和 find_previous()

　　（1）通过标签名查找

　　（2）通过类名查找

　　（3）通过 id 名查找

　　（4）组合查找

　　（5）属性查找