【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析分类： H3_NUTCH 2014-06-04 20:10 1039人阅读评论(0) 收藏

在创建一个job后，就要开始job的运行，运行的全流程如下：

1、在界面上启动job

2、index.jsp

查看上述页面对应的源代码


3、action.jsp






4、CrawlJobHandler.jsp

（1）

    public void startCrawler() {
        running = true;
        if (pendingCrawlJobs.size() > 0 && isCrawling() == false) {
            // Ok, can just start the next job
            startNextJob();
        }
    }


（2）


    protected final void startNextJob() {
        synchronized (this) {
            if(startingNextJob != null) {
                try {
                    startingNextJob.join();
                } catch (InterruptedException e) {
                    e.printStackTrace();
                    return;
                }
            }
            startingNextJob = new Thread(new Runnable() {
                public void run() {
                    startNextJobInternal();
                }
            }, "StartNextJob");
            startingNextJob.start();
        }
    }


（3）


   protected void startNextJobInternal() {
        if (pendingCrawlJobs.size() == 0 || isCrawling()) {
            // No job ready or already crawling.
            return;
        }
        this.currentJob = (CrawlJob)pendingCrawlJobs.first();
        assert pendingCrawlJobs.contains(currentJob) :
            "pendingCrawlJobs is in an illegal state";
        pendingCrawlJobs.remove(currentJob);
        try {
            this.currentJob.setupForCrawlStart();
            // This is ugly but needed so I can clear the currentJob
            // reference in the crawlEnding and update the list of completed
            // jobs.  Also, crawlEnded can startup next job.
            this.currentJob.getController().addCrawlStatusListener(this);
            // now, actually start
            this.currentJob.getController().requestCrawlStart();
        } catch (InitializationException e) {
            loadJob(getStateJobFile(this.currentJob.getDirectory()));
            this.currentJob = null;
            startNextJobInternal(); // Load the next job if there is one.
        }
    }


（4）


    public void requestCrawlStart() {
        runProcessorInitialTasks();

        sendCrawlStateChangeEvent(STARTED, CrawlJob.STATUS_PENDING);
        String jobState;
        state = RUNNING;
        jobState = CrawlJob.STATUS_RUNNING;
        sendCrawlStateChangeEvent(this.state, jobState);

        // A proper exit will change this value.
        this.sExit = CrawlJob.STATUS_FINISHED_ABNORMAL;
        
        Thread statLogger = new Thread(statistics);
        statLogger.setName("StatLogger");
        statLogger.start();
        
        frontier.start();
    }





    
                 
            版权声明：本文为博主原创文章，未经博主允许不得转载。

【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析 分类： H3_NUTCH 2014-06-04 20:10 1039人阅读 评论(0) 收藏

相关推荐

【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析分类： H3_NUTCH 2014-06-04 20:10 1039人阅读评论(0) 收藏