Akedar's Weblog

Is wordpress a CMS/ECM killer!!

September 26, 2008

With 1.7 million WordPress downloads of vers 2.6 and loads of plugins for everything imaginable, is WordPress going to replace traditional CMS/WCM/ECM applications ? Drupal and Joomla are qually impressive. Read more at informationzen.org

Tags: CMS, drupal, wcm, wordpress
Posted in CMS | Leave a Comment »

Getting Nutch 1.2 to index file system

October 12, 2010

1 – Changes to conf/nutch-site.xml

– add protocol for file
– remove protocol for http if no web spidering is required

old line
   <value>protocol-file|protocol-http|urlfilter-regex|
      parse-(text|html|js|tika)|index-(basic|anchor)|
      query-(basic|site|url)|response-(json|xml)|
      summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
new line
    <value>protocol-file|urlfilter-regex|parse-(text|html|js|tika)|
      index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|
      summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

Add a line to stop the root from being indexed

   <property>
      <name>file.crawl.parent</name>
      <value>false</value>
   </property>

2. Create Seed file. This contains a list of URI’s that will be starting point for crawler

in Nutch home directory, create seedURLs

     > cat SeedUrls/url
     file://d:/Sandbox

3. Update conf/regex-urlfilter.txt

update line to allow file

 # skip file: ftp: and mailto: urls 
 #-^(file|ftp|mailto):  
 # accept anything else

4. Clean data from previous runs

This step is important as every time you run nutch saves the crawled links in database and will start, indexing the saved links, irrespective of the new seed uri’s have the old link.

> rm -r $NUTCH_HOME/indexDir

5. Run Nutch

> cd $NUTCH_HOME

> bin/nutch crawl seedUrl/urls -dir indexDir

Tags: Lucene, Nutch
Posted in Lucene/Nutch/Solr | Leave a Comment »

YES to NoSQL

August 13, 2010

What problem are noSQL databases solving?

Ease of use, installation and maintenance
Schema free – yet support searching, indexing, CRUD operations
Support storing or multiple data formats, including large binary data
Small footprint
Scalability
Cost
High performance by way of lesser functionality and dropping

noSQL is to databases what RISC architecture are to CPU architecture. Performance and simplicity over complexity.

NoSQL style databases has often been termed non-relational databases, not surprisingly noSQL databases support relationships. If they did not support relationships they would be of no use. noSQL databases is a movement towards schema free databases. They do not support implicitly support JOIN, this operation is done within application using iteration or hookups.

Whats wrong with Relational Databases

Cost
Setup and configuration effort
Supervision and periodic tuning
Overkill for small applications and medium sized apps
Rigid schema, applications grow and schema soon starts getting less cleaner
Modifying schema is hard
Harder to scale
Most web applications geared towards search/insert
Normalizing data:
- was important when storage was costly
- Joins are costly
- Supports data integrity where rules are strictly enforced and data is consistent
- hard to maintain multiple levels of truth
- Not suited for fuzzy search or non-indexed search
- De-normalized data is better manipulated by application where rules can be customized depending upon the situation
- De-normalized data is better approach if application only does inserts and searches
noSQL databases need atomicity on a single record level only

Drawbacks of noSQL databases

Data Integrity needs to be enforced in application
Duplicate and inconsistent data

Advantages

Small footprint – mobile devices, web based apps
Suited for JSON/Rest based interfaces
Scalability
Performance
Availability
Infrastructure – supported in clouds, any OS
Fluid schema

Both these DB are disk based and document oriented Open-Source databases. They have a rich text based administration interface

Useful Links

MongoDB Admin

MongoDB interactive shell

NoSQL Landscape

MongoDb

Tags: mongoDb, noSQL
Posted in noSQL | Leave a Comment »

MongoDb

August 13, 2010

Scalable
Indexing
Geo-spatial indexing, finding objects by location, proximity
Auto Sharding (version 1.7 onwards) – scaling by division
- Queries can be run in parallel across all shards
Storage – disk based, using BSON (Binary serialized JSON)
Supports binary large objects, images, videos
Database replication
- support of replication clusters
- Automatic fail-over
- Master slave(s) configuration

Concepts/Terms MongoDB==Relational DB

Documents == records/objects

Collections == tables

A collection may have heterogeneous set of documents
No need to pre-define columns or fields within a collection

query == cursor

Index == indexes

embedding and linking == Join

Queries return cursor not collections (for performance reasons)

Drawbacks

No Transaction support

No data integrity

MongoDB – Installation and First Steps

MongoDb Interactive shell – basic commands

MongoDb Interactive shell – searching records

Tags: mongoDb, noSQL
Posted in mongoDB, noSQL | Leave a Comment »

Web 3.0 and Semantic Web

October 1, 2008

Imagine searching for something on web and finding it in the first page of search results, how often does this happen ? if you happen to be looking for some product then one sees results after results selling the product. Anyway Semantic Web or Web 3.0 could change all this, as web sites start marking their content one is more likely to find search results to be relevant. As the content ( include bad and misleading content) grows exponentially finding relevant content will grow more and more difficult. Read what Semantic web is all about and how it will change our web experience.

Searching is just one of the areas where we will see the influence of Web 3.0, changes will also affect on how we are able to mashup information from different sites to get relevant answers. For example a query ‘list all US Presidents that studied in ivy league ?’ . looking at the query we instinctively know what we are looking for, search engines will parse it for keywords and list out all web pages that have the words US president, ivy league. and since the term ivy league applies to certain universities the results would be all over the place. The above scenario is a bit simplistic because most search engines understand that ‘US Presidents’ as a single keyword and not two, anyway I hope you get my drift on what we can expect from Web 3.0.

Semantic web article is also posted at informationzen.org.

Tags: CMS, microformats, rdf, rdfa, SemanticWeb, Web 3.0
Posted in SemanticWeb, Uncategorized, Web 3.0 | Leave a Comment »

Microformats – first step towards semantic web

September 26, 2008

Semantic web will have a tough time relying solely on natural language processing to extract semantic knowledge from content. Microformats are not only useful for annotating data on web pages but their format specifications are a good starting point for designing content entry forms for commonly used entities. Some of commonly used microformats are

hCard for user profile
hCalendar for calendar or events
hReview – for product/recipe reviews
geo – geographical location
hatom – rss entry/blogs/web pages
XFN – human relationships

Using microformats is not just useful for semantic web, which is not expected until year 2010, but embedding microformats is more likely to increase your rankings in search listings. With the amount of data on web growing exponentially, keyword based search is becoming less of a solution. Yahoo has already announced that it will be processing web content for semantic data and its support for microformats.

Tags: CMS, microformat, rdfa, SemanticWeb, yahoo
Posted in CMS, SemanticWeb, Web 3.0 | Leave a Comment »

Akedar’s Weblog

Is wordpress a CMS/ECM killer!!

Getting Nutch 1.2 to index file system

YES to NoSQL

What problem are noSQL databases solving?

Whats wrong with Relational Databases

Drawbacks of noSQL databases

Advantages

MongoDb

MongoDb

Concepts/Terms MongoDB==Relational DB

Drawbacks

MongoDB – Installation and First Steps

MongoDb Interactive shell – basic commands

MongoDb Interactive shell – searching records

Web 3.0 and Semantic Web

Microformats – first step towards semantic web

Archives

Categories

Pages