脚本突然停止爬网而没有错误或异常

脚本突然停止爬网而没有错误或异常

问题描述:

我不确定为什么,但是一旦命中

I'm not sure why, but my script always stops crawling once it hits page 9. There are no errors, exceptions, or warnings, so I'm kind of at a loss.

有人可以帮我吗?

P.S. 此处是完整脚本,以防万一有人想自己对其进行测试!

def initiate_crawl():
    def refresh_page(url):
        ff = create_webdriver_instance()
        ff.get(url)
        ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
        ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
        items = WebDriverWait(ff, 15).until(
            EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
        )
        print(len(items))
        for count, item in enumerate(items):
            slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
            active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
            if len(slashed_price) > 0 and len(active_deals) > 0:
                product_title = item.find_element(By.ID, 'dealTitle').text
                if product_title not in already_scraped_product_titles:
                    already_scraped_product_titles.append(product_title)
                    url = ff.current_url
                    ff.quit()
                    refresh_page(url)
                    break
            if count+1 is len(items):
                try:
                    next_button = WebDriverWait(ff, 15).until(
                        EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
                    )
                    ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
                    url = ff.current_url
                    ff.quit()
                    refresh_page(url)
                except Exception as error:
                    print(error)
                    ff.quit()

    refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')

initiate_crawl()

打印items的长度也会引起一些奇怪的行为.而不是总是返回32(这对应于每页上的项目数),而是在第一页上打印32,在第二页上打印64,在第三页上打印96,依此类推.我通过使用//div[contains(@id, "100_dealView_")]/div[contains(@class, "dealContainer")]而不是//div[contains(@id, "100_dealView_")]作为items变量的XPath来解决此问题.我希望这是它在第9页上出现问题的原因.我现在正在运行测试. 更新:现在正在抓取第10页及以后的页面,因此该问题已解决.

Printing the length of items invokes some strange behaviour too. Instead of it always returning 32, which would correspond to the number of items on each page, it prints 32 for the first page, 64 for the second, 96 for the third, so on and so forth. I fixed this by using //div[contains(@id, "100_dealView_")]/div[contains(@class, "dealContainer")] instead of //div[contains(@id, "100_dealView_")] as the XPath for the items variable. I'm hoping this is the reason why it runs into issues on page 9. I'm running tests right now. Update: It is now scraping page 10 and beyond, so the issue is resolved.

按照您的第10 th 修订这个问题的错误消息...

As per your 10th revision of this question the error message...

HTTPConnectionPool(host='127.0.0.1', port=58992): Max retries exceeded with url: /session/e8beed9b-4faa-4e91-a659-56761cb604d7/element (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000022D31378A58>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))

...表示get()方法未能引发 HTTPConnectionPool 错误,并显示消息最大重试次数.

...implies that the get() method failed raising HTTPConnectionPool error with a message Max retries exceeded.

几件事:

  • 根据讨论最大重试次数超出的例外令人困惑 追溯有点误导.请求包装异常是为了方便用户.原始异常是显示的消息的一部分.
  • 请求永不重试(它为urllib3的HTTPConnectionPool设置了retries=0),因此如果没有 MaxRetryError HTTPConnectionPool ,该错误将更为规范. >关键字.因此理想的 Traceback 应该是:

  • As per the discussion max-retries-exceeded exceptions are confusing the traceback is somewhat misleading. Requests wraps the exception for the users convenience. The original exception is part of the message displayed.
  • Requests never retries (it sets the retries=0 for urllib3's HTTPConnectionPool), so the error would have been much more canonical without the MaxRetryError and HTTPConnectionPool keywords. So an ideal Traceback would have been:

NewConnectionError(<class 'socket.error'>: [Errno 10061] No connection could be made because the target machine actively refused it)

  • 您将在> MaxRetryError:HTTPConnectionPool :超过了最大重试次数(由ProtocolError(连接中止.",错误(111,连接被拒绝")引起))

    根据 Selenium 3.14.1 发行说明:

    * Fix ability to set timeout for urllib3 (#6286)
    

    合并是:修复urllib3无法设置超时! /a>

    The Merge is: repair urllib3 can't set timeout!

    一旦升级到 Selenium 3.14.1 ,您将可以设置超时并查看规范的 Tracebacks ,并且可以采取必要的措施.

    Once you upgrade to Selenium 3.14.1 you will be able to set the timeout and see canonical Tracebacks and would be able to take required action.

    几个相关的事件引用:

    • Adding max_retries as an argument
    • Removed the bundled charade and urllib3.
    • Third party libraries committed verbatim

    我从 codepen.io-安东尼写的一本书中提取了您的完整脚本.我必须对您现有的代码进行一些调整,如下所示:

    I have taken your full script from codepen.io - A PEN BY Anthony. I had to make a few tweaks to your existing code as follows:

    • 您曾经使用过:

    • As you have used:

    ua_string = random.choice(ua_strings)
    

    您必须强制将random导入为:

    import random
    

  • 您已经创建了变量 next_button ,但尚未使用它.我总结了以下四行:

  • You have created the variable next_button but haven't used it. I have clubbed up the following four lines:

    next_button = WebDriverWait(ff, 15).until(
                    EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
                )
    ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
    

    作为:

    WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→'))
    ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()              
    

  • 您修改后的代码块将是:

  • Your modified code block will be:

    # -*- coding: utf-8 -*-
    from selenium import webdriver
    from selenium.webdriver.firefox.options import Options
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.support.ui import WebDriverWait
    import time
    import random
    
    
    """ Set Global Variables
    """
    ua_strings = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36']
    already_scraped_product_titles = []
    
    
    
    """ Create Instances of WebDriver
    """
    def create_webdriver_instance():
        ua_string = random.choice(ua_strings)
        profile = webdriver.FirefoxProfile()
        profile.set_preference('general.useragent.override', ua_string)
        options = Options()
        options.add_argument('--headless')
        return webdriver.Firefox(profile)
    
    
    
    """ Construct List of UA Strings
    """
    def fetch_ua_strings():
        ff = create_webdriver_instance()
        ff.get('https://techblog.willshouse.com/2012/01/03/most-common-user-agents/')
        ua_strings_ff_eles = ff.find_elements_by_xpath('//td[@class="useragent"]')
        for ua_string in ua_strings_ff_eles:
            if 'mobile' not in ua_string.text and 'Trident' not in ua_string.text:
                ua_strings.append(ua_string.text)
        ff.quit()
    
    
    
    """ Log in to Amazon to Use SiteStripe in order to Generate Affiliate Links
    """
    def log_in(ff):
        ff.find_element(By.XPATH, '//a[@id="nav-link-yourAccount"] | //a[@id="nav-link-accountList"]').click()
        ff.find_element(By.ID, 'ap_email').send_keys('anthony_falez@hotmail.com')
        ff.find_element(By.ID, 'continue').click()
        ff.find_element(By.ID, 'ap_password').send_keys('lo0kyLoOkYig0t4h')
        ff.find_element(By.NAME, 'rememberMe').click()
        ff.find_element(By.ID, 'signInSubmit').click()
    
    
    
    """ Build Lists of Product Page URLs
    """
    def initiate_crawl():
        def refresh_page(url):
        ff = create_webdriver_instance()
        ff.get(url)
        ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
        ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
        items = WebDriverWait(ff, 15).until(
            EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
        )
        for count, item in enumerate(items):
            slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
            active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
            # For Groups of Items on Sale
            # active_deals = //*[contains(text(), "Add to Cart") or contains(text(), "View Deal")]
            if len(slashed_price) > 0 and len(active_deals) > 0:
                product_title = item.find_element(By.ID, 'dealTitle').text
                if product_title not in already_scraped_product_titles:
                    already_scraped_product_titles.append(product_title)
                    url = ff.current_url
                    # Scrape Details of Each Deal
                    #extract(ff, item.find_element(By.ID, 'dealImage').get_attribute('href'))
                    print(product_title[:10])
                    ff.quit()
                    refresh_page(url)
                    break
            if count+1 is len(items):
                try:
                    print('')
                    print('new page')
                    WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→'))
                    ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
                    time.sleep(10)
                    url = ff.current_url
                    print(url)
                    print('')
                    ff.quit()
                    refresh_page(url)
                except Exception as error:
                    """
                    ff.find_element(By.XPATH, '//*[@id="pagination-both-004143081429407891"]/ul/li[9]/a').click()
                    url = ff.current_url
                    ff.quit()
                    refresh_page(url)
                    """
                    print('cannot find ff.find_element(By.PARTIAL_LINK_TEXT, "Next?")')
                    print('Because of... {}'.format(error))
                    ff.quit()
    
        refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')
    
    #def extract_info(ff, url):
    fetch_ua_strings()
    initiate_crawl()
    

  • 控制台输出:使用 Selenium v​​3.14.0 Firefox Quantum v62.0.3 ,我可以在控制台上提取以下输出:

  • Console Output: With Selenium v3.14.0 and Firefox Quantum v62.0.3, I can extract the following output on the console:

    J.Rosée Si
    B.Catcher 
    Bluetooth4
    FRAM G4164
    Major Crim
    20% off Oh
    True Blood
    Prime-Line
    Marathon 3
    True Blood
    B.Catcher 
    4 Film Fav
    True Blood
    Texture Pa
    Westinghou
    True Blood
    ThermoPro 
    ...
    ...
    ...
    

  • 注意:我本可以优化您的代码并执行相同的网络抓取操作来初始化 Firefox浏览器客户端一次,并遍历各种产品及其详细信息.但是,为了保留您的 逻辑 创新 ,我建议您进行一些必要的改动.

    Note: I could have optimized your code and performed the same web scrapping operations initializing the Firefox Browser Client only once and traverse through various products and their details. But to preserve your logic and innovation I have suggested the minimal changes required to get you through.