scrapy爬虫不得不爬取第一页中的链接

scrapy爬虫只能爬取第一页中的链接
现在只能爬取第一页中的链接和第一页链接中的链接，也就是说现在页码显示1-5，则只能爬前5页的数据，即使到了第5页的时候里面有第6-9的页码，也不进行爬取，我的理解是爬虫没有对爬取回来的链接进一步爬取，但是不能理解是为什么……我也没有设置爬取深度，搞了好久都不明白，代码如下，求大神解救orz

class TestSpider(CrawlSpider):

    name = 'testSpider'

    num = 0

    allow_domain = ['http://wz.sun0769.com/']

    start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4']

    rules = {

        Rule(LxmlLinkExtractor(allow='page')),

        Rule(LxmlLinkExtractor(allow='/index\.php/question/questionType\?type=4$')),

        Rule(LxmlLinkExtractor(allow='/html/question/\d+/\d+\.shtml$'), callback='parse_content')

    }

    _x_query = {

        'title': '''//div[contains(@class, 'pagecenter p3')]/div/div/div[contains(@class,'cleft')]/strong/text()''',

        'content': '''//div[contains(@class, 'c1 text14_2')]/text()''',

        'content_first': '''//div[contains(@class, 'contentext')]/text()'''

    }


    def parse_content(self, response):

        bbs_item_loader = ItemLoader(item=TutorialItem(), response=response)

        content = response.xpath(self._x_query['content_first']).extract()

        if len(content) == 0:

            content = str(response.xpath(self._x_query['content']).extract()[0].encode('utf-8'))

        else:

            content = str(content[0].encode('utf-8'))

        title = str(response.xpath(self._x_query['title']).extract()[0].encode('utf-8'))

        title_list = title.split(' ')

        number = title_list[-1]

        number = number.split(':')[-1]

        url = str(response.url)

        bbs_item_loader.add_value('url', url)

        bbs_item_loader.add_value('number', number)

        bbs_item_loader.add_value('title', title)

        bbs_item_loader.add_value('content', content)

        # bbs_item_loader.add_xpath('content', self._x_query['content'])


        return bbs_item_loader.load_item()

------解决思路----------------------



Rule(LxmlLinkExtractor(allow='/index\.php/question/questionType\?type=4$')),

这个应该是用来获取页面链接的. 正则表达式中的$应去掉.

scrapy爬虫不得不爬取第一页中的链接

相关推荐