无法从网页获取所有链接

问题描述:

我正在做一个网页抓取项目.我正在抓取的网站的 URL 是 https://www.beliani.de/sofas/ledersofa/

I am working on a Web scraping project. The URL for the website I am scraping is https://www.beliani.de/sofas/ledersofa/

我正在抓取此页面上列出的所有产品链接.我尝试使用 Requests-HTMLSelenium 获取链接.但是我分别得到了 57 个和 24 个链接.虽然页面上列出了 150 多种产品.下面是我正在使用的代码块.

I am scraping all the links of products listed on this page. I tried getting links using both Requests-HTML and Selenium. But I am getting 57 and 24 links respectively. While there are more than 150 products listed on the page. Below are the code blocks I am using.

使用硒:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep

options = Options()
options.add_argument("user-agent = Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36")

#path to crome driver
DRIVER_PATH = 'C:/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH, chrome_options=options)

url = 'https://www.beliani.de/sofas/ledersofa/'

driver.get(url)
sleep(20)

links = []
for a in driver.find_elements_by_xpath('//*[@id="offers_div"]/div/div/a'):
    print(a)
    links.append(a)
print(len(links))

使用请求 HTML:

from requests_html import HTMLSession

url = 'https://www.beliani.de/sofas/ledersofa/'

s = HTMLSession()
r = s.get(url)

r.html.render(sleep = 20)

products = r.html.xpath('//*[@id="offers_div"]', first = True)

#Getting 57 links using below block:
links = []
for link in products.absolute_links:
    print(link)
    links.append(link)

print(len(links))

我不知道我做错了哪一步或遗漏了什么.

I am not getting which step I am doing wrong or what is missing.

您必须滚动浏览网站并到达页面末尾才能加载所有脚本 在网页中.只需打开网站,我们将仅加载查看网页特定部分所需的脚本.因此,当您运行代码时,它只能从那些已加载的脚本中检索数据.

You have to scroll through the website and reach the end of the page in order to load all the scripts in the webpage. Just by opening the website we will load only the script that is necessary to view that particular section of the webpage. Therefore when you ran your code it could retrieve data from only those scripts that were loaded.

这个给了我 160 个链接:

This one gave me 160 links :

driver.get('https://www.beliani.de/sofas/ledersofa/')
sleep(3)

#gets the whole height of the document
height = driver.execute_script('return document.body.scrollHeight')

# now break the webpage into parts so that each section in the page is scrolled through to load
scroll_height = 0
for i in range(10):
    scroll_height = scroll_height + (height/10)
    driver.execute_script('window.scrollTo(0,arguments[0]);',scroll_height)
    sleep(2)

# I have used the 'class' locator you can use anything you want once we have completed the loop
a_tags = driver.find_elements_by_class_name('itemBox')
count = 0
for i in a_tags:
    if i.get_attribute('href') is not None:
        print(i.get_attribute('href'))
        count+=1

print(count)
driver.quit()