nutch1.4:爬虫定时抓取设立
nutch1.4:爬虫定时抓取设置
nutch1.4定时爬取数据配合linux定时任务可以实现nutch的自动定时爬取,linux定时任务请参考《 Linux定时执行任务命令 :at和crontab》
步骤如下:
1、首先查看当前用户的 crontab服务执行命令:
crontab -l 执行结果: no crontab for *** 表示没有定义 crontab 服务
2、编辑crontab服务:
crontab -e */10 * * * * /home/*/*.sh //每10分钟执行一次 ,*.sh中包含nutch抓取脚本如crawl
注意设置服务执行账户,此处设置为root如果是其他账户则需要对应修改为其他账户名。为*.sh文件设置可执行权限。
3、执行sudo apt-get install libnotify-bin
4、重新启动cron进程:~#sudo /etc/init.d/cron restart 观察运行结果。重启可能不成功,使用如下步骤重新启动:
15:40:34^O^bin$ sudo /etc/init.d/cron stop [sudo] password for sniffer: Rather than invoking init scripts through /etc/init.d, use the service(8) utility, e.g. service cron stop Since the script you are attempting to invoke has been converted to an Upstart job, you may also use the stop(8) utility, e.g. stop cron cron stop/waiting 15:40:49^O^bin$ ps -A | grep cron 15:40:54^O^bin$ sudo /etc/int.d/cron start sudo: /etc/int.d/cron: command not found 15:41:11^O^bin$ sudo /etc/init.d/cron start Rather than invoking init scripts through /etc/init.d, use the service(8) utility, e.g. service cron start Since the script you are attempting to invoke has been converted to an Upstart job, you may also use the start(8) utility, e.g. start cron cron start/running, process 14362 15:41:19^O^bin$ ps -A | grep cron 14362 ? 00:00:00 cron
注:nutch脚本存在无法找到JAVA_HOME的问题可以修改如下部分解决:
if [ "$JAVA_HOME" = "" ]; then #echo "Error: JAVA_HOME is not set." #exit 1 JAVA_HOME="***" fi