在Windows下运行Nutch,很简单,只要你能执行Crawl这个类就行,写一个Ant脚本放在Nuthc的根目录下执行它就OK,内容如下:
< project name ="nutch-crawl" default ="crawl" basedir ="." > < property name ="lib.dir" location ="lib" /> < property name ="conf.dir" location ="conf" /> < path id ="project.classpath" > < fileset dir ="." includes ="nutch-*.jar" /> < fileset dir ="lib" /> < pathelement path ="." /> < pathelement path ="${conf.dir}" /> </ path > < target name ="crawl" > < echo > crwaling starting </ echo > < property name ="JVM.extra.args" value ="-Xmx512m" /> < java classname ="org.apache.nutch.crawl.Crawl" classpathref ="project.classpath" fork ="true" > < jvmarg line ="${JVM.extra.args}" /> < arg value ="C:/dev-tools/nutch-0.9/urls" /> <!-- url.txt文件存放的目录 --> < arg value ="-dir" /> < arg value ="C:/dev-tools/nutch-0.9/crawl" /> <!-- 爬虫文件存放的目录 --> < arg value ="-depth" /> < arg value ="3" /> < arg value ="-threads" /> < arg value ="15" /> </ java > < echo > crwaling finished </ echo > </ target > </ project >
启动bulid.xml批处理文件run.bat(放在Nuthc的根目录,假若工程放在E盘下)
@echo off cd e:antpause
