使用 xpath 从表中提取元素时,Scrapy 返回空输出
我一直试图抓取这个网站,其中包含科罗拉多州油井的详细信息https://cogcc.state.co.us/cogis/FacilityDetail.asp?faacid=12307555&type=WELL
I have been trying to scrape this website that has details of oil wells in Colorado https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=12307555&type=WELL
Scrapy 抓取网站,并在抓取时返回 URL,但是当我需要使用它的 XPath(油井县)提取表中的元素时,我得到的只是一个空输出,即 [].
Scrapy scrapes the website, and returns the URL when I scrape it, but when I need to extract an element inside a table using it's XPath (County of the oil well), all i get is a null output, ie [].
我尝试访问页面中的任何元素都会发生这种情况.
This happens for any element I try to access in the page.
这是我的蜘蛛:
import scrapy
import json
class coloradoSpider(scrapy.Spider):
name = "colorado"
allowed_domains = ["cogcc.state.co.us"]
start_urls = ["https://cogcc.state.co.us/cogis/ProductionWellMonthly.asp?APICounty=123&APISeq=07555&APIWB=00&Year=All"]
def parse(self, response):
url = response.url
response.selector.remove_namespaces()
variable = (response.xpath("/html/body/blockquote/font/font/table/tbody/tr[3]/th[3]").extract())
print url, variable
这是输出:
2015-05-13 20:14:54+0530 [scrapy] INFO: Scrapy 0.24.6 started (bot: tutorial)
2015-05-13 20:14:54+0530 [scrapy] INFO: Optional features available: ssl, http11
2015-05-13 20:14:54+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutor
ial'}
2015-05-13 20:14:54+0530 [scrapy] INFO: Enabled extensions: LogStats, TelnetCons
ole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-13 20:14:55+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-13 20:14:55+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2015-05-13 20:14:56+0530 [scrapy] INFO: Enabled item pipelines:
2015-05-13 20:14:56+0530 [colorado] INFO: Spider opened
2015-05-13 20:14:56+0530 [colorado] INFO: Crawled 0 pages (at 0 pages/min), scra
ped 0 items (at 0 items/min)
2015-05-13 20:14:56+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2015-05-13 20:14:56+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-05-13 20:15:02+0530 [colorado] DEBUG: Crawled (200) <GET https://cogcc.stat
e.co.us/cogis/ProductionWellMonthly.asp?APICounty=123&APISeq=07555&APIWB=00&Year
=All> (referer: None)
https://cogcc.state.co.us/cogis/ProductionWellMonthly.asp?APICounty=123&APISeq=0
7555&APIWB=00&Year=All []
2015-05-13 20:15:02+0530 [colorado] INFO: Closing spider (finished)
2015-05-13 20:15:02+0530 [colorado] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 292,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 366770,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 5, 13, 14, 45, 2, 349000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 5, 13, 14, 44, 56, 77000)}
2015-05-13 20:15:02+0530 [colorado] INFO: Spider closed (finished)
如果我返回 XPath 上的几个节点,我会得到一个输出,其中 Scrapy 以 HTML 格式返回表格.
If I go back a couple of nodes on the XPath, I get an output where Scrapy returns the table in HTML.
谢谢!
好像是xpath的问题,在这个站点开发过程中可能省略了tbody
,但是浏览器在查看时自动插入通过浏览器.您可以从此处获得更多信息.
Seems like its an xpath problem, in this site during the development they might have omitted tbody
but a browser automatically inserted when its viewed through the browser. You can get more info about this from here.
所以您需要在给定页面中的县值 (WELD #123
) 那么可能的 xpath
将是,
So you need county's value (WELD #123
) in the given page then the possible xpath
will be,
In [20]: response.xpath('/html/body/font/table/tr[6]/td[2]//text()').extract()
Out[20]: [u'WELD #123']