Is wordpress a CMS/ECM killer!!

September 26, 2008

With 1.7 million WordPress downloads of vers 2.6 and loads of plugins for everything imaginable, is WordPress going to replace traditional CMS/WCM/ECM applications ? Drupal and Joomla are qually impressive.  Read more at informationzen.org

Getting Nutch 1.2 to index file system

October 12, 2010

1 – Changes to conf/nutch-site.xml

  • – add protocol for file
  • – remove protocol for http if no web spidering is required
old line
   <value>protocol-file|protocol-http|urlfilter-regex|
      parse-(text|html|js|tika)|index-(basic|anchor)|
      query-(basic|site|url)|response-(json|xml)|
      summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
new line
    <value>protocol-file|urlfilter-regex|parse-(text|html|js|tika)|
      index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|
      summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

Add a line to stop the root from being indexed

   <property>
      <name>file.crawl.parent</name>
      <value>false</value>
   </property>

2. Create Seed file. This contains a list of URI’s that will be starting point for crawler

  • in Nutch home directory, create seedURLs
     > cat SeedUrls/url
     file://d:/Sandbox

3. Update conf/regex-urlfilter.txt

update line to allow file

 # skip file: ftp: and mailto: urls 
 #-^(file|ftp|mailto):  
 # accept anything else

+.

4. Clean data from previous runs

This step is important as every time you run nutch saves the crawled links in database and will start, indexing the saved links, irrespective of the new seed uri’s have the old link.

> rm -r $NUTCH_HOME/indexDir

5. Run Nutch

>  cd $NUTCH_HOME

>  bin/nutch crawl seedUrl/urls  -dir indexDir


YES to NoSQL

August 13, 2010

What problem are noSQL databases solving?

  • Ease of use, installation and maintenance
  • Schema free – yet support searching, indexing, CRUD operations
  • Support storing or multiple data formats, including large binary data
  • Small footprint
  • Scalability
  • Cost
  • High performance by way of lesser functionality and dropping

noSQL is to databases what RISC architecture are to CPU architecture. Performance and simplicity over complexity.

NoSQL style databases has often been termed non-relational databases, not surprisingly noSQL databases support relationships.  If they did not support relationships they would be of no use.  noSQL databases is a movement towards schema free databases.  They do not support implicitly support JOIN, this operation is done within application  using iteration or hookups.

Whats wrong with Relational Databases

  • Cost
  • Setup and configuration effort
  • Supervision and periodic tuning
  • Overkill for small applications and medium sized apps
  • Rigid schema, applications grow and schema soon starts getting less cleaner
  • Modifying schema is hard
  • Harder to scale
  • Most web applications geared towards search/insert
  • Normalizing data:
    • was important when storage was costly
    • Joins are costly
    • Supports data integrity where rules are strictly enforced and data is consistent
    • hard to maintain multiple levels of truth
    • Not suited for fuzzy search or non-indexed search
    • De-normalized data is better manipulated by application where rules can be customized depending upon the situation
    • De-normalized data is better approach if application only does inserts and searches
  • noSQL databases need atomicity on a single record level only

Drawbacks of noSQL databases

  • Data Integrity needs to be enforced in application
  • Duplicate and inconsistent data

Advantages

  • Small footprint – mobile devices, web based apps
  • Suited for JSON/Rest based interfaces
  • Scalability
  • Performance
  • Availability
  • Infrastructure – supported in clouds, any OS
  • Fluid schema

Both these DB are disk based and document oriented Open-Source databases.  They have a rich text based administration interface

Useful Links

MongoDB Admin

MongoDB interactive shell

NoSQL Landscape

MongoDb

MongoDb

August 13, 2010
  • Scalable
  • Indexing
  • Geo-spatial indexing, finding objects by location, proximity
  • Auto Sharding (version 1.7 onwards) – scaling by division
    • Queries can be run in parallel across all shards
  • Storage – disk based, using BSON (Binary serialized JSON)
  • Supports binary large objects, images, videos
  • Database replication
    • support of replication clusters
    • Automatic fail-over
    • Master slave(s) configuration

Concepts/Terms MongoDB==Relational DB

  • Documents == records/objects
  • Collections == tables
    • A collection may have heterogeneous set of documents
    • No need to pre-define columns or fields within a collection
  • query == cursor
  • Index == indexes
  • embedding and linking == Join
  • Queries return cursor not collections (for performance reasons)
  • Drawbacks

  • No Transaction support
  • No data integrity
  • MongoDB – Installation and First Steps

    MongoDb Interactive shell – basic commands

    MongoDb Interactive shell – searching records

    Web 3.0 and Semantic Web

    October 1, 2008

    Imagine searching for something on web and finding it in the first page of search results, how often does this happen ?  if you happen to be looking for some product then one sees results after results selling the product.  Anyway Semantic Web or Web 3.0 could change all this,  as web sites start marking their content one is more likely to find search results to be relevant.  As the content ( include bad and misleading content) grows exponentially finding relevant content will grow more and more difficult.  Read what Semantic web is all about and how it will change our web experience.

    Searching is just one of the areas where we will see the influence  of Web 3.0, changes will also affect on how we are able to mashup information from different sites to get relevant answers.  For example a query ‘list all US Presidents that studied in ivy league ?’ .  looking at the query we instinctively know what we are looking for, search engines will parse it for keywords and list out all web pages that have the words US president, ivy league.  and since the term ivy league applies to certain universities the results would be all over the place.  The above scenario is a bit simplistic because most search engines understand that ‘US Presidents’ as a single keyword and not two, anyway I hope you get my drift on what we can expect from Web 3.0.

    Semantic web article is also posted at informationzen.org.

    Microformats – first step towards semantic web

    September 26, 2008

    Semantic web will have a tough time relying solely on natural language processing to extract semantic knowledge from content.  Microformats are not only useful for annotating data on web pages but their format specifications are a good starting point for designing content entry forms for commonly used entities.  Some of commonly used microformats are

    • hCard for user profile
    • hCalendar for calendar or events
    • hReview – for product/recipe reviews
    • geo – geographical location
    • hatom – rss entry/blogs/web pages
    • XFN – human relationships

    Using microformats is not just useful for semantic web, which is not expected until year 2010, but embedding microformats is more likely to increase your rankings in search listings.  With the amount of data on web growing exponentially, keyword based search is becoming less of a solution.  Yahoo has already announced that it will be processing web content for semantic data and its support for microformats.