脚本突然停止爬网而没有错误或异常
I'm not sure why, but my script always stops crawling once it hits page 9. There are no errors, exceptions, or warnings, so I'm kind of at a loss.
有人可以帮我吗?
def initiate_crawl():
def refresh_page(url):
ff = create_webdriver_instance()
ff.get(url)
ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
items = WebDriverWait(ff, 15).until(
EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
)
print(len(items))
for count, item in enumerate(items):
slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
if len(slashed_price) > 0 and len(active_deals) > 0:
product_title = item.find_element(By.ID, 'dealTitle').text
if product_title not in already_scraped_product_titles:
already_scraped_product_titles.append(product_title)
url = ff.current_url
ff.quit()
refresh_page(url)
break
if count+1 is len(items):
try:
next_button = WebDriverWait(ff, 15).until(
EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
)
ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
url = ff.current_url
ff.quit()
refresh_page(url)
except Exception as error:
print(error)
ff.quit()
refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')
initiate_crawl()
打印items
的长度也会引起一些奇怪的行为.而不是总是返回32(这对应于每页上的项目数),而是在第一页上打印32
,在第二页上打印64
,在第三页上打印96
,依此类推.我通过使用//div[contains(@id, "100_dealView_")]/div[contains(@class, "dealContainer")]
而不是//div[contains(@id, "100_dealView_")]
作为items
变量的XPath来解决此问题.我希望这是它在第9页上出现问题的原因.我现在正在运行测试. 更新:现在正在抓取第10页及以后的页面,因此该问题已解决.
Printing the length of items
invokes some strange behaviour too. Instead of it always returning 32, which would correspond to the number of items on each page, it prints 32
for the first page, 64
for the second, 96
for the third, so on and so forth. I fixed this by using //div[contains(@id, "100_dealView_")]/div[contains(@class, "dealContainer")]
instead of //div[contains(@id, "100_dealView_")]
as the XPath for the items
variable. I'm hoping this is the reason why it runs into issues on page 9. I'm running tests right now. Update: It is now scraping page 10 and beyond, so the issue is resolved.
按照您的第10 th 修订这个问题的错误消息...
As per your 10th revision of this question the error message...
HTTPConnectionPool(host='127.0.0.1', port=58992): Max retries exceeded with url: /session/e8beed9b-4faa-4e91-a659-56761cb604d7/element (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000022D31378A58>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))
...表示get()
方法未能引发 HTTPConnectionPool 错误,并显示消息最大重试次数.
...implies that the get()
method failed raising HTTPConnectionPool error with a message Max retries exceeded.
几件事:
- 根据讨论最大重试次数超出的例外令人困惑 追溯有点误导.请求包装异常是为了方便用户.原始异常是显示的消息的一部分.
-
请求永不重试(它为urllib3的
HTTPConnectionPool
设置了retries=0
),因此如果没有 MaxRetryError 和 HTTPConnectionPool ,该错误将更为规范. >关键字.因此理想的 Traceback 应该是:
- As per the discussion max-retries-exceeded exceptions are confusing the traceback is somewhat misleading. Requests wraps the exception for the users convenience. The original exception is part of the message displayed.
Requests never retries (it sets the
retries=0
for urllib3'sHTTPConnectionPool
), so the error would have been much more canonical without the MaxRetryError and HTTPConnectionPool keywords. So an ideal Traceback would have been:
NewConnectionError(<class 'socket.error'>: [Errno 10061] No connection could be made because the target machine actively refused it)
您将在> MaxRetryError:HTTPConnectionPool :超过了最大重试次数(由ProtocolError(连接中止.",错误(111,连接被拒绝")引起))
根据 Selenium 3.14.1 的发行说明:
* Fix ability to set timeout for urllib3 (#6286)
The Merge is: repair urllib3 can't set timeout!
一旦升级到 Selenium 3.14.1 ,您将可以设置超时并查看规范的 Tracebacks ,并且可以采取必要的措施.
Once you upgrade to Selenium 3.14.1 you will be able to set the timeout and see canonical Tracebacks and would be able to take required action.
几个相关的事件引用:
- Adding max_retries as an argument
- Removed the bundled charade and urllib3.
- Third party libraries committed verbatim
我从 codepen.io-安东尼写的一本书中提取了您的完整脚本.我必须对您现有的代码进行一些调整,如下所示:
I have taken your full script from codepen.io - A PEN BY Anthony. I had to make a few tweaks to your existing code as follows:
-
您曾经使用过:
As you have used:
ua_string = random.choice(ua_strings)
您必须强制将random
导入为:
import random
您已经创建了变量 next_button ,但尚未使用它.我总结了以下四行:
You have created the variable next_button but haven't used it. I have clubbed up the following four lines:
next_button = WebDriverWait(ff, 15).until(
EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
)
ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
作为:
WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→'))
ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
您修改后的代码块将是:
Your modified code block will be:
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import time
import random
""" Set Global Variables
"""
ua_strings = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36']
already_scraped_product_titles = []
""" Create Instances of WebDriver
"""
def create_webdriver_instance():
ua_string = random.choice(ua_strings)
profile = webdriver.FirefoxProfile()
profile.set_preference('general.useragent.override', ua_string)
options = Options()
options.add_argument('--headless')
return webdriver.Firefox(profile)
""" Construct List of UA Strings
"""
def fetch_ua_strings():
ff = create_webdriver_instance()
ff.get('https://techblog.willshouse.com/2012/01/03/most-common-user-agents/')
ua_strings_ff_eles = ff.find_elements_by_xpath('//td[@class="useragent"]')
for ua_string in ua_strings_ff_eles:
if 'mobile' not in ua_string.text and 'Trident' not in ua_string.text:
ua_strings.append(ua_string.text)
ff.quit()
""" Log in to Amazon to Use SiteStripe in order to Generate Affiliate Links
"""
def log_in(ff):
ff.find_element(By.XPATH, '//a[@id="nav-link-yourAccount"] | //a[@id="nav-link-accountList"]').click()
ff.find_element(By.ID, 'ap_email').send_keys('anthony_falez@hotmail.com')
ff.find_element(By.ID, 'continue').click()
ff.find_element(By.ID, 'ap_password').send_keys('lo0kyLoOkYig0t4h')
ff.find_element(By.NAME, 'rememberMe').click()
ff.find_element(By.ID, 'signInSubmit').click()
""" Build Lists of Product Page URLs
"""
def initiate_crawl():
def refresh_page(url):
ff = create_webdriver_instance()
ff.get(url)
ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
items = WebDriverWait(ff, 15).until(
EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
)
for count, item in enumerate(items):
slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
# For Groups of Items on Sale
# active_deals = //*[contains(text(), "Add to Cart") or contains(text(), "View Deal")]
if len(slashed_price) > 0 and len(active_deals) > 0:
product_title = item.find_element(By.ID, 'dealTitle').text
if product_title not in already_scraped_product_titles:
already_scraped_product_titles.append(product_title)
url = ff.current_url
# Scrape Details of Each Deal
#extract(ff, item.find_element(By.ID, 'dealImage').get_attribute('href'))
print(product_title[:10])
ff.quit()
refresh_page(url)
break
if count+1 is len(items):
try:
print('')
print('new page')
WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→'))
ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
time.sleep(10)
url = ff.current_url
print(url)
print('')
ff.quit()
refresh_page(url)
except Exception as error:
"""
ff.find_element(By.XPATH, '//*[@id="pagination-both-004143081429407891"]/ul/li[9]/a').click()
url = ff.current_url
ff.quit()
refresh_page(url)
"""
print('cannot find ff.find_element(By.PARTIAL_LINK_TEXT, "Next?")')
print('Because of... {}'.format(error))
ff.quit()
refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')
#def extract_info(ff, url):
fetch_ua_strings()
initiate_crawl()
控制台输出:使用 Selenium v3.14.0 和 Firefox Quantum v62.0.3 ,我可以在控制台上提取以下输出:
Console Output: With Selenium v3.14.0 and Firefox Quantum v62.0.3, I can extract the following output on the console:
J.Rosée Si
B.Catcher
Bluetooth4
FRAM G4164
Major Crim
20% off Oh
True Blood
Prime-Line
Marathon 3
True Blood
B.Catcher
4 Film Fav
True Blood
Texture Pa
Westinghou
True Blood
ThermoPro
...
...
...
注意:我本可以优化您的代码并执行相同的网络抓取操作来初始化 Firefox浏览器客户端仅一次,并遍历各种产品及其详细信息.但是,为了保留您的 逻辑 和 创新 ,我建议您进行一些必要的改动.
Note: I could have optimized your code and performed the same web scrapping operations initializing the Firefox Browser Client only once and traverse through various products and their details. But to preserve your logic and innovation I have suggested the minimal changes required to get you through.