Scrapy :: JSON导出问题

问题描述:

因此,我花了很多时间浏览Scrapy文档和教程,从那时起,我便开始使用一个非常基本的爬虫.但是,我无法将输出转换成JSON文件.我觉得自己似乎缺少一些明显的东西,但是在查看了许多其他示例并尝试了几种不同的方法之后,我无法将任何内容调高.

So, I have spent quite a bit of time going through the Scrapy documentation and tutorials, and I have since been plugging away at a very basic crawler. However, I am not able to get the output into a JSON file. I feel like I am missing something obvious, but I haven't been able to turn anything up after looking at a number of other examples, and trying several different things out.

为更全面,我将包括所有相关代码.我想在这里得到的是一些特定的物品及其相关的价格.价格会经常变化,而商品的变化频率会低得多.

To be thorough, I will include all of the relevant code. What I am trying to get here is some specific items and their associated prices. The prices will change fairly often, and the items will change with much lower frequency.

这是我的items.py:

Here is my items.py :

class CartItems(Item):
    url = Field()
    name = Field()
    price = Field()

这是蜘蛛:

from scrapy.selector import HtmlXPathSelector                                                                                                                                        
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field

from Example.items import CartItems

class DomainSpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/path/to/desired/page']


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        cart = CartItems()
        cart['url'] = hxs.select('//title/text()').extract()
        cart['name'] = hxs.select('//td/text()').extract()[1]
        cart['price'] = hxs.select('//td/text()').extract()[2]
        return cart

例如,如果我在网址

If for example I run hxs.select('//td/text()').extract()[1] from the Scrapy shell on the URL http://www.example.com/path/to/desired/page, then I get the following response:

u'Text field I am trying to download'

好吧,所以我写了一个管道,该管道遵循我在Wiki中找到的管道(最近几天我在研究此内容时以某种方式错过了本节),只是更改为使用JSON而不是XML.

Okay, so I wrote a pipeline that follows one I found in the wiki (I somehow missed this section when I was digging through this the last few days), just altered to use JSON instead of XML.

from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.contrib.exporter import JsonItemExporter

class JsonExportPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_items.json' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = JsonItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

这确实会输出文件"example.com_items.json",但其中仅包含"[]".所以,我这里还有些问题.蜘蛛网是否存在问题,或者管道未正确完成?显然,我在这里缺少一些东西,因此,如果有人可以向正确的方向推我,或者将任何可能帮助我的示例链接到我身上,那将不胜感激.

This does output a file "example.com_items.json", but all it contains is "[]". So, I something is still not right here. Is the issue with the spider, or is the pipeline not done correctly? Clearly I am missing something here, so if someone could nudge me in the right direction, or link me any examples that might help out, that would be most appreciated.

我从JsonExportPipeline复制了您的代码,并在我的机器上进行了测试. 它可以与我的蜘蛛配合使用.

I copied your code from JsonExportPipeline and tested on my machine. It works fine with my spider.

所以我认为您应该检查该页面.

So I think you should check the page.

start_urls = ['http://www.example.com/path/to/desired/page']

也许您的解析函数在提取内容方面有问题.这是下面的功能:

Maybe your parse function has something wrong of extracting the content. Which is the function below:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    cart = CartItems()
    cart['url'] = hxs.select('//title/text()').extract()
    cart['name'] = hxs.select('//td/text()').extract()[1]
    cart['price'] = hxs.select('//td/text()').extract()[2]
    return cart