python爬虫学习 python爬虫学习


开发环境

  1. 编译器版本Python 3.6
  2. 32bit:python-3.6.2.exe
  3. 64bit:python-3.6.5.exe
  4. 开发工具:Pycharm Jupyter-notebook
  5. 浏览器类型:google最新版本

安装步骤

  1. Python3.6
  2. Pycharm
  3. Juptyter

安装库

  1. requests ->安装方法:pip install requests
  2. beautifulsoup4 ->安装方法:pip install beautifulsoup4
  3. html5lib ->安装方法:pip install html5lib
  4. lxml ->安装方法:pip install lxml

requests库

定制请求头

响应状态码

r = requests.get('http://httpbin.org/get')
r.status_code

正常输出结果为:200


BeautifulSoup4

简介

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

  • 导包:from bs4 import BeautifulSoup
  • soup = BeautifulSoup(open(“index.html”))
  • soup = BeautifulSoup(“data”)

对象的种类Tag:Tag对象与XML或HTML原生文档中的tag相同

  • soup = BeautifulSoup(‘Extremely bold’)
  • tag = soup.b
  • type(tag)
  • 输出: <class ‘bs4.element.Tag’>

String

  • 遍历的字符串
  • tag.string
  • 输出:u’Extremely bold’
  • type(tag.string)
  • 输出: <class ‘bs4.element.NavigableString’>

find_all,find

  • 搜索函数:find_all(),搜索所有满足条件的内容,返回list列表
  • find()函数:搜索一个内容,第一个,返回tag对象

下面给出一个示例代码:

 #!-*-coding:utf-8-*-
 #! Author : WX
 # time  2018 10 30

import requests
from bs4 import BeautifulSoup
import os
from urllib.request import urlretrieve

def get_two_page():
    # 1.发送请求
    # 2.判断状态
    # 3.获取内容
    # 4.使用bs4解析内容
    # 5.重新定义规则:1.名字 2.出生日期 3.身高 4.三围 5.详细信息。。。 6.私人照
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    response = requests.get(url=URL, headers=headers)
    if response.status_code == 200:

        response.encoding = 'utf-8'
        soup = BeautifulSoup(response.content, 'html5lib')
        file = open("校花网二级页面数据.txt", "w", encoding='utf-8')
        txt = ''

        # td内容
        for tabel in soup.find_all('table'):
           for tr in tabel.find_all('tr'):
                # 人物信息
                name = tr.find('td').next_element.next_element.string
                txt += "姓名:" + str(name) + "
"
        # 详细信息
        for div_hot_tag in soup.find('div',attrs={'class':'infocontent'}):
            # 详细信息
            all_news = div_hot_tag.string
            txt += "详细信息:" + str(all_news) + "
"
        # 图片
        ul_list = soup.find('div',attrs={'class':'post_entry'})
        for ul in ul_list:
            if ul != None:
                for li_list in ul_list.find_all('li'):
                    for li in li_list:
                        img_path = li.find('img')['src']
                        txt += "图片:" + img_path + "
"
                        get_info(img_path)

        # 写入
        file.write(txt)
        # 关闭
        file.close()
        print("采集完毕")
	else:
    	print("你访问的内容属于和谐,访问失败")

def get_info(img_path):
	download1 = 'download2Pic'
 	# 判断目录是否存在
	if not os.path.exists(download1):
   		 os.mkdir(download1)
	name = img_path.split('/')
	# 获取最后一位的内容
    str = name[len(name) - 1]
	try:
    	print(str + ".jpg" + "下载中....................")
    	urlretrieve(img_path, download1 + '/' + str + '.jpg')
	except:
    	print("未满18岁,不能观看,下载失败")

if __name__ == '__main__':
	URL = "http://www.xiaohuar.com/p-1-1994.html"
	get_two_page()

2018/10/30 20:22:40