如何按自定义顺序对scrapy项目信息进行排序?
scrapy 中的默认顺序是字母,我已经阅读了一些使用 OrderedDict 以自定义顺序输出项目的帖子.
我写了一个蜘蛛跟随网页.
如何获取 Scrapy 项中的字段顺序
The default order in scrapy is alphabet,i have read some post to use OrderedDict to output item in customized order.
I write a spider follow the webpage.
How to get order of fields in Scrapy item
我的物品.py.
import scrapy
from collections import OrderedDict
class OrderedItem(scrapy.Item):
def __init__(self, *args, **kwargs):
self._values = OrderedDict()
if args or kwargs:
for k, v in six.iteritems(dict(*args, **kwargs)):
self[k] = v
class StockinfoItem(OrderedItem):
name = scrapy.Field()
phone = scrapy.Field()
address = scrapy.Field()
简单的蜘蛛文件.
import scrapy
from info.items import InfoItem
class InfoSpider(scrapy.Spider):
name = 'Info'
allowed_domains = ['quotes.money.163.com']
start_urls = [ "http://quotes.money.163.com/f10/gszl_600023.html"]
def parse(self, response):
item = InfoItem()
item["name"] = response.xpath('/html/body/div[2]/div[4]/table/tr[2]/td[2]/text()').extract()
item["phone"] = response.xpath('/html/body/div[2]/div[4]/table/tr[7]/td[4]/text()').extract()
item["address"] = response.xpath('/html/body/div[2]/div[4]/table/tr[2]/td[4]/text()').extract()
item.items()
yield item
何时运行蜘蛛的爬虫信息.
The scrapy info when to run the spider.
2019-04-25 13:45:01 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{'address': ['浙江省杭州市天目山路152号浙能大厦'],'name': ['浙能电力'],'phone': ['0571-87210223']}
为什么我无法获得如下所需的订单?
Why i can't get such desired order as below?
{'name': ['浙能电力'],'phone': ['0571-87210223'],'address': ['浙江省杭州市天目山路152号浙能大厦']}
感谢 Gallaecio 的建议,在 settings.py 中添加以下内容.
Thank for Gallaecio's advice, to add the following in settings.py.
FEED_EXPORT_FIELDS=['name','phone','address']
执行spider并输出到csv文件.
Execute the spider and output to csv file.
scrapy crawl info -o info.csv
现场顺序是我自定义的顺序.
The field order is in my customized order.
cat info.csv
name,phone,address
浙能电力,0571-87210223,浙江省杭州市天目山路152号浙能大
查看scrapy的调试信息:
Look at the scrapy's debug info :
2019-04-26 00:16:38 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{'address': ['浙江省杭州市天目山路152号浙能大厦'],
'name': ['浙能电力'],
'phone': ['0571-87210223']}
如何按自定义顺序制作调试信息?如何获得以下调试输出?
How can i make the debug info in customized order?How to get the following debug output?
2019-04-26 00:16:38 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{'name': ['浙能电力'],
'phone': ['0571-87210223'],
'address': ['浙江省杭州市天目山路152号浙能大厦'],}
问题出在Item
的__repr__
函数中.原来它的代码是:
Problem is in __repr__
function of Item
. Originally its code is:
def __repr__(self):
return pformat(dict(self))
因此,即使您将项目转换为 OrderedDict
并期望字段以相同的顺序保存,此函数也会对其应用 dict()
并打破顺序.
So even if you convert your item to OrderedDict
and expect fields to be saved in the same order, this function applies dict()
to it and breaks the order.
所以,我建议你以你喜欢的方式重载它,例如:
So, I propose you to overload it in the way you like, for example:
import json
class OrderedItem(scrapy.Item):
def __init__(self, *args, **kwargs):
self._values = OrderedDict()
if args or kwargs:
for k, v in six.iteritems(dict(*args, **kwargs)):
self[k] = v
def __repr__(self):
return json.dumps(OrderedDict(self), ensure_ascii = False) # it should return some string
现在你可以得到这个输出:
And now you can get this output:
2019-04-30 18:56:20 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{"name": ["\u6d59\u80fd\u7535\u529b"], "phone": ["0571-87210223"], "address": ["\u6d59\u6c5f\u7701\u676d\u5dde\u5e02\u5929\u76ee\u5c71\u8def152\u53f7\u6d59\u80fd\u5927\u53a6"]}