Archive for the ‘Lucene/Nutch/Solr’ Category

Getting Nutch 1.2 to index file system

October 12, 2010

1 – Changes to conf/nutch-site.xml

  • – add protocol for file
  • – remove protocol for http if no web spidering is required
old line
   <value>protocol-file|protocol-http|urlfilter-regex|
      parse-(text|html|js|tika)|index-(basic|anchor)|
      query-(basic|site|url)|response-(json|xml)|
      summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
new line
    <value>protocol-file|urlfilter-regex|parse-(text|html|js|tika)|
      index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|
      summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

Add a line to stop the root from being indexed

   <property>
      <name>file.crawl.parent</name>
      <value>false</value>
   </property>

2. Create Seed file. This contains a list of URI’s that will be starting point for crawler

  • in Nutch home directory, create seedURLs
     > cat SeedUrls/url
     file://d:/Sandbox

3. Update conf/regex-urlfilter.txt

update line to allow file

 # skip file: ftp: and mailto: urls 
 #-^(file|ftp|mailto):  
 # accept anything else

+.

4. Clean data from previous runs

This step is important as every time you run nutch saves the crawled links in database and will start, indexing the saved links, irrespective of the new seed uri’s have the old link.

> rm -r $NUTCH_HOME/indexDir

5. Run Nutch

>  cd $NUTCH_HOME

>  bin/nutch crawl seedUrl/urls  -dir indexDir


Advertisements