【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析 分类: H3_NUTCH 2014-06-04 20:10 1039人阅读 评论(0) 收藏
在创建一个job后,就要开始job的运行,运行的全流程如下:
1、在界面上启动job
2、index.jsp
查看上述页面对应的源代码
3、action.jsp
4、CrawlJobHandler.jsp(1)
public void startCrawler() { running = true; if (pendingCrawlJobs.size() > 0 && isCrawling() == false) { // Ok, can just start the next job startNextJob(); } }
(2)protected final void startNextJob() { synchronized (this) { if(startingNextJob != null) { try { startingNextJob.join(); } catch (InterruptedException e) { e.printStackTrace(); return; } } startingNextJob = new Thread(new Runnable() { public void run() { startNextJobInternal(); } }, "StartNextJob"); startingNextJob.start(); } }
(3)protected void startNextJobInternal() { if (pendingCrawlJobs.size() == 0 || isCrawling()) { // No job ready or already crawling. return; } this.currentJob = (CrawlJob)pendingCrawlJobs.first(); assert pendingCrawlJobs.contains(currentJob) : "pendingCrawlJobs is in an illegal state"; pendingCrawlJobs.remove(currentJob); try { this.currentJob.setupForCrawlStart(); // This is ugly but needed so I can clear the currentJob // reference in the crawlEnding and update the list of completed // jobs. Also, crawlEnded can startup next job. this.currentJob.getController().addCrawlStatusListener(this); // now, actually start this.currentJob.getController().requestCrawlStart(); } catch (InitializationException e) { loadJob(getStateJobFile(this.currentJob.getDirectory())); this.currentJob = null; startNextJobInternal(); // Load the next job if there is one. } }
(4)public void requestCrawlStart() { runProcessorInitialTasks(); sendCrawlStateChangeEvent(STARTED, CrawlJob.STATUS_PENDING); String jobState; state = RUNNING; jobState = CrawlJob.STATUS_RUNNING; sendCrawlStateChangeEvent(this.state, jobState); // A proper exit will change this value. this.sExit = CrawlJob.STATUS_FINISHED_ABNORMAL; Thread statLogger = new Thread(statistics); statLogger.setName("StatLogger"); statLogger.start(); frontier.start(); }
版权声明:本文为博主原创文章,未经博主允许不得转载。