nutch1.2爬虫在eclipse上运行遇到的有关问题

nutch1.2爬虫在eclipse下运行遇到的问题

      最近在研究nutch,将爬虫的源码导入eclipse。参照apache的一个wiki进行了配置。

 

http://wiki.apache.org/nutch/RunNutchInEclipse1.0

 

  可是运行起单元测试起来会报出异常:

 

 

2011-05-27 11:15:46,747 WARN  regex.RegexURLNormalizer (RegexURLNormalizer.java:setConf(113)) - Can't load the default config file! regex-normalize.xml
2011-05-27 11:15:46,760 INFO  conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - prefix-urlfilter.txt not found
2011-05-27 11:15:46,773 INFO  conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - suffix-urlfilter.txt not found
2011-05-27 11:15:46,775 WARN  suffix.SuffixURLFilter (SuffixURLFilter.java:readConfigurationFile(175)) - Missing urlfilter.suffix.file, all URLs will be rejected!
2011-05-27 11:15:46,785 INFO  conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - regex-urlfilter.txt not found
2011-05-27 11:15:46,786 ERROR api.RegexURLFilterBase (RegexURLFilterBase.java:setConf(138)) - Can't find resource: regex-urlfilter.txt
2011-05-27 11:15:46,794 INFO  conf.Configuration (Configuration.java:getConfResourceAsReader(965)) - automaton-urlfilter.txt not found
2011-05-27 11:15:46,795 ERROR api.RegexURLFilterBase (RegexURLFilterBase.java:setConf(138)) - Can't find resource: automaton-urlfilter.txt
2011-05-27 11:15:46,800 WARN  domain.DomainURLFilter (DomainURLFilter.java:setConf(135)) - Attribute "file" is not defined in plugin.xml for plugin urlfilter-domain
2011-05-27 11:15:46,801 INFO  conf.Configuration (Configuration.java:getConfResourceAsReader(968)) - found resource domain-urlfilter.txt at file:/boot/wx-zone/nutch_all/bin/domain-urlfilter.txt
2011-05-27 11:15:46,868 WARN  domain.DomainSuffixes (DomainSuffixes.java:<init>(47)) - java.net.MalformedURLException
    at java.net.URL.<init>(URL.java:601)
    at java.net.URL.<init>(URL.java:464)
    at java.net.URL.<init>(URL.java:413)
    at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
    at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at org.apache.nutch.util.domain.DomainSuffixesReader.read(DomainSuffixesReader.java:54)
    at org.apache.nutch.util.domain.DomainSuffixes.<init>(DomainSuffixes.java:44)

 

显示的是一些配置文件txt没有装载,可是在命令行模式下是可以运行的。

 

我最后的解决方法是将爬虫根目录下的所有配置文件复制到  src/test     package下一份,解决了。看来nutch的测试对于test来说是依赖很大。 比较混乱。