requests模块的基本操作

1.requests模块的get请求

需求:爬取sogou首页页面数据

import requests

url = "https://www.sogou.com/"
response = requests.get(url=url)
# 获取字符串形式的页面数据
page = response.text
with open("./sogou.html", "w", encoding="utf-8") as fp:
    fp.write(page)

其他的一些方法如下:

# 获取二进制/byte形式的页面数据
print(response.content)
# 获取响应状态码
print(response.status_code)
# 获取响应的头信息
print(response.headers)
# 获取请求的url
print(response.url)

至于带参数的get请求,直接调用get方法或者使用字典参数的方法,如下代码:

import requests

url = "https://www.sogou.com/web"

# 将参数封装到字典中
params = {
    'query': "周杰伦",
    'ie': "utf-8",
}

response = requests.get(url=url, params=params)

print(response.status_code)
print(response.content)

同样方法还有headers参数。

2.post请求

登陆豆瓣电影,获取登陆成功后的数据(这里由于豆瓣的url已经更换,所以只是示例)

import requests

# 目前这个url已经失效了,这里只做示例
url = "https://accounts.douban.com/login"

# 封装post请求的参数
data = {
    "source": "movie",
    "redir": "https://movie.douban.com/",
    "form_email": "1111",  # 你的账号密码
    "form_password": "11111",
    "login": "登录",
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}
# 发起post请求
response = requests.post(url=url, data=data, headers=headers)


print(response.status_code)
print(response.text)
with open("./douban.html", "w", encoding="utf-8") as fp:
    fp.write(response.text)

3.Ajax的get请求

需求:抓取豆瓣电影上排行榜上爱情片的详情

import requests

url = "https://movie.douban.com/j/chart/top_list?"

params = {
    "type": "5",
    "interval_id": "100:90",
    "action": "",
    "start": "120",
    "limit": "20",
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}

response = requests.get(url=url, params=params, headers=headers)

print(response.text)

4.Ajax的post请求

需求:爬取肯德基城市餐厅的位置数据

import requests

url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"

params = {
    "cname": "",
    "pid": "",
    "keyword": "北京",
    "pageIndex": "1",
    "pageSize": "10",
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}

response = requests.post(url=url, params=params, headers=headers)

print(response.text)

5.综合操作

需求:爬取搜狗知乎某一个词条多个页码的页面数据

import requests
import os

# 创建一个文件夹
if not os.path.exists("./pages"):
    os.mkdir("./pages")

url = "https://zhihu.sogou.com/zhihu?"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}

# 搜索的词条
word = input("please enter your word:")
# 指定页码的范围
start_num = int(input("enter start page number:"))
end_num = int(input("enter end page number:"))

for page in range(start_num, end_num+1):
    param = {
        "query": word,
        "page": page,
        "ie": "utf-8",
    }
    response = requests.get(url=url, params=param, headers=headers)
    filename = word +  str(page) + ".html"
    # 持久化数据
    with open("pages/%s" % filename, "w", encoding="utf-8") as fp:
        fp.write(response.text)

6.cookie操作

流程:1.登录,获取cookie  2.在发起个人主页请求时,需要cookie携带到该请求中

注意: session对象,发送请求(会将cookie对象进行自动存储)

import requests

session = requests.session()

# 发起登录请求
login_url = "https://accounts.douban.com/passport/login"

data = {
    "source": 'None',
    "redir": "https://movie.douban.com/people/123/",
    "form_email": "123",  # 你的账号密码
    "form_password": "123",
    "login": "登录",
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}

session_response = session.post(url=login_url, data=data, headers=headers)

url = 'https://movie.douban.com/people/123/'
response = session.get(url=url, headers=headers)
page = response.text
with open("./doubanlogin.html", "w", encoding="utf-8") as fp:
    fp.write(page)

注意:由于豆瓣的api更换了上述参数失效了,了解流程就好。当然你可以直接构造cookie来模拟登录,当然这样非常繁琐。

7.代理操作

import requests

proxies = {
    "http": "http://10.10.1.10:3128",
    "https": "http://10.10.1.10:1080",
}

url = "https://www.taobao.com"

requests.get(url=url, proxies=proxies)

当然这个代理是无效的,要换成你自己的有效代理。requests支持socks协议的代理,需要用到socks库。

对于requests模块,远不止这些功能,需要的自己详细了解。