Getting Nutch 1.2 to index file system

1 – Changes to conf/nutch-site.xml

  • – add protocol for file
  • – remove protocol for http if no web spidering is required
old line
   <value>protocol-file|protocol-http|urlfilter-regex|
      parse-(text|html|js|tika)|index-(basic|anchor)|
      query-(basic|site|url)|response-(json|xml)|
      summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
new line
    <value>protocol-file|urlfilter-regex|parse-(text|html|js|tika)|
      index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|
      summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

Add a line to stop the root from being indexed

   <property>
      <name>file.crawl.parent</name>
      <value>false</value>
   </property>

2. Create Seed file. This contains a list of URI’s that will be starting point for crawler

  • in Nutch home directory, create seedURLs
     > cat SeedUrls/url
     file://d:/Sandbox

3. Update conf/regex-urlfilter.txt

update line to allow file

 # skip file: ftp: and mailto: urls 
 #-^(file|ftp|mailto):  
 # accept anything else

+.

4. Clean data from previous runs

This step is important as every time you run nutch saves the crawled links in database and will start, indexing the saved links, irrespective of the new seed uri’s have the old link.

> rm -r $NUTCH_HOME/indexDir

5. Run Nutch

>  cd $NUTCH_HOME

>  bin/nutch crawl seedUrl/urls  -dir indexDir


Advertisements

Tags: ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: