Scrapy :: JSON导出问题
因此,我花了很多时间浏览Scrapy文档和教程,从那时起,我便开始使用一个非常基本的爬虫.但是,我无法将输出转换成JSON文件.我觉得自己似乎缺少一些明显的东西,但是在查看了许多其他示例并尝试了几种不同的方法之后,我无法将任何内容调高.
So, I have spent quite a bit of time going through the Scrapy documentation and tutorials, and I have since been plugging away at a very basic crawler. However, I am not able to get the output into a JSON file. I feel like I am missing something obvious, but I haven't been able to turn anything up after looking at a number of other examples, and trying several different things out.
为更全面,我将包括所有相关代码.我想在这里得到的是一些特定的物品及其相关的价格.价格会经常变化,而商品的变化频率会低得多.
To be thorough, I will include all of the relevant code. What I am trying to get here is some specific items and their associated prices. The prices will change fairly often, and the items will change with much lower frequency.
这是我的items.py:
Here is my items.py :
class CartItems(Item):
url = Field()
name = Field()
price = Field()
这是蜘蛛:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from Example.items import CartItems
class DomainSpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/path/to/desired/page']
def parse(self, response):
hxs = HtmlXPathSelector(response)
cart = CartItems()
cart['url'] = hxs.select('//title/text()').extract()
cart['name'] = hxs.select('//td/text()').extract()[1]
cart['price'] = hxs.select('//td/text()').extract()[2]
return cart
If for example I run hxs.select('//td/text()').extract()[1] from the Scrapy shell on the URL http://www.example.com/path/to/desired/page, then I get the following response:
u'Text field I am trying to download'
好吧,所以我写了一个管道,该管道遵循我在Wiki中找到的管道(最近几天我在研究此内容时以某种方式错过了本节),只是更改为使用JSON而不是XML.
Okay, so I wrote a pipeline that follows one I found in the wiki (I somehow missed this section when I was digging through this the last few days), just altered to use JSON instead of XML.
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.contrib.exporter import JsonItemExporter
class JsonExportPipeline(object):
def __init__(self):
dispatcher.connect(self.spider_opened, signals.spider_opened)
dispatcher.connect(self.spider_closed, signals.spider_closed)
self.files = {}
def spider_opened(self, spider):
file = open('%s_items.json' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = JsonItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
这确实会输出文件"example.com_items.json",但其中仅包含"[]".所以,我这里还有些问题.蜘蛛网是否存在问题,或者管道未正确完成?显然,我在这里缺少一些东西,因此,如果有人可以向正确的方向推我,或者将任何可能帮助我的示例链接到我身上,那将不胜感激.
This does output a file "example.com_items.json", but all it contains is "[]". So, I something is still not right here. Is the issue with the spider, or is the pipeline not done correctly? Clearly I am missing something here, so if someone could nudge me in the right direction, or link me any examples that might help out, that would be most appreciated.
我从JsonExportPipeline复制了您的代码,并在我的机器上进行了测试. 它可以与我的蜘蛛配合使用.
I copied your code from JsonExportPipeline and tested on my machine. It works fine with my spider.
所以我认为您应该检查该页面.
So I think you should check the page.
start_urls = ['http://www.example.com/path/to/desired/page']
也许您的解析函数在提取内容方面有问题.这是下面的功能:
Maybe your parse function has something wrong of extracting the content. Which is the function below:
def parse(self, response):
hxs = HtmlXPathSelector(response)
cart = CartItems()
cart['url'] = hxs.select('//title/text()').extract()
cart['name'] = hxs.select('//td/text()').extract()[1]
cart['price'] = hxs.select('//td/text()').extract()[2]
return cart