无法使用BeautifulSoup解析Google搜索结果页面
我正在使用bs4中的BeautifulSoup在python中解析网页.当我检查google搜索页面的元素时,这是第一个结果的部门:
I'm parsing webpages using BeautifulSoup from bs4 in python. When I inspected the elements of a google search page, this was the division having the 1st result:
,因为它具有class = 'r'
,所以我编写了这段代码:
and since it had class = 'r'
I wrote this code:
import requests
site = requests.get('https://www.google.com/search?client=firefox-b-d&ei=CLtgXt_qO7LH4-EP6LSzuAw&q=%22narendra+modi%22+%\22scams%22+%\22frauds%22+%\22corruption%22+%22modi%22+-lalit+-nirav&oq=%22narendra+modi%22+%\22scams%22+%\22frauds%22+%\22corruption%22+%22modi%22+-lalit+-nirav&gs_l=psy-ab.3...5077.11669..12032...5.0..0.202.2445.1j12j1......0....1..gws-wiz.T_WHav1OCvk&ved=0ahUKEwjfjrfv94LoAhWy4zgGHWjaDMcQ4dUDCAo&uact=5')
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)
但是命令提示符仅返回[]
But the command prompt returned just []
可能出了什么问题以及如何解决?
What could've gone wrong and how to correct it?
此外, 我通过添加标题字典来相应地编辑了代码,但结果却是相同的[]
.
这是新代码:
EDIT 1: I edited my code accordingly by adding the dictionary for headers, yet the result is the same []
.
Here's the new code:
import requests
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
site = requests.get('https://www.google.com/search?client=firefox-b-d&ei=CLtgXt_qO7LH4-EP6LSzuAw&q=%22narendra+modi%22+%22cams%22+%22frauds%22+%22corruption%22+%22modi%22+-lalit+-nirav&oq=%22narendra+modi%22+%22scams%22+%22frauds%22+%22corruption%22+%22modi%22+-lalit+-nirav&gs_l=psy-ab.3...5077.11669..12032...5.0..0.202.2445.1j12j1......0....1..gws-wiz.T_WHav1OCvk&ved=0ahUKEwjfjrfv94LoAhWy4zgGHWjaDMcQ4dUDCAo&uact=5', headers = headers)
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)
注意::当我告诉它打印整个页面时,没有问题,或者当我使用list(page.children)
时,它都可以正常工作.
NOTE: When I tell it to print the entire page, there's no problem, or when I take list(page.children)
, it works fine.
某些网站要求设置User-Agent
标头,以防止来自非浏览器的 fake 请求.但是,幸运的是,有这样一种方法可以将标头传递给请求
Some website requires User-Agent
header to be set to prevent fake request from non-browser. But, fortunately there's a way to pass headers to the request as such
# Define a dictionary of http request headers
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
# Pass in the headers as a parameterized argument
requests.get(url, headers=headers)
注意:可以找到用户代理列表此处
Note: List of user agents can be found here