无法使用BeautifulSoup解析Google搜索结果页面

问题描述：

我正在使用bs4中的BeautifulSoup在python中解析网页.当我检查google搜索页面的元素时，这是第一个结果的部门:

I'm parsing webpages using BeautifulSoup from bs4 in python. When I inspected the elements of a google search page, this was the division having the 1st result:

，因为它具有class = 'r'，所以我编写了这段代码:

and since it had class = 'r' I wrote this code:

import requests
site = requests.get('https://www.google.com/search?client=firefox-b-d&ei=CLtgXt_qO7LH4-EP6LSzuAw&q=%22narendra+modi%22+%\22scams%22+%\22frauds%22+%\22corruption%22+%22modi%22+-lalit+-nirav&oq=%22narendra+modi%22+%\22scams%22+%\22frauds%22+%\22corruption%22+%22modi%22+-lalit+-nirav&gs_l=psy-ab.3...5077.11669..12032...5.0..0.202.2445.1j12j1......0....1..gws-wiz.T_WHav1OCvk&ved=0ahUKEwjfjrfv94LoAhWy4zgGHWjaDMcQ4dUDCAo&uact=5')
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)

但是命令提示符仅返回[]

But the command prompt returned just []

可能出了什么问题以及如何解决?

What could've gone wrong and how to correct it?

此外，我通过添加标题字典来相应地编辑了代码，但结果却是相同的[]. 这是新代码:

EDIT 1: I edited my code accordingly by adding the dictionary for headers, yet the result is the same []. Here's the new code:

import requests
headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
site = requests.get('https://www.google.com/search?client=firefox-b-d&ei=CLtgXt_qO7LH4-EP6LSzuAw&q=%22narendra+modi%22+%22cams%22+%22frauds%22+%22corruption%22+%22modi%22+-lalit+-nirav&oq=%22narendra+modi%22+%22scams%22+%22frauds%22+%22corruption%22+%22modi%22+-lalit+-nirav&gs_l=psy-ab.3...5077.11669..12032...5.0..0.202.2445.1j12j1......0....1..gws-wiz.T_WHav1OCvk&ved=0ahUKEwjfjrfv94LoAhWy4zgGHWjaDMcQ4dUDCAo&uact=5', headers = headers)
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)

注意::当我告诉它打印整个页面时，没有问题，或者当我使用list(page.children)时，它都可以正常工作.

NOTE: When I tell it to print the entire page, there's no problem, or when I take list(page.children) , it works fine.

答

某些网站要求设置User-Agent标头，以防止来自非浏览器的 fake 请求.但是，幸运的是，有这样一种方法可以将标头传递给请求

Some website requires User-Agent header to be set to prevent fake request from non-browser. But, fortunately there's a way to pass headers to the request as such

# Define a dictionary of http request headers
headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
} 

# Pass in the headers as a parameterized argument
requests.get(url, headers=headers)

注意:可以找到用户代理列表此处

Note: List of user agents can be found here

无法使用BeautifulSoup解析Google搜索结果页面

相关推荐