I decided to take a look at nutch and see how useful it might be in regards to what our overall goal is for the project... and didn't think it was quite was we are looking for, better yet.. I think we should write our own functions and web front ends to better suit out needs .... Read on for my reasoning...
Nutch for our purposes would be the parser of multiple document formats for the use of them being index into lucene, and it would be the front end search engine to search the lucene db and display the reults in some sort of organized fashion. This is all fine and dandy, but it is my thoughts that Nutch, while snazzy and already written, may be a step sideways vs forward. Nutch does currently (althogh not fully supported) have a pdf parser along with a few others, but not very many. I have found via out friends at google and our good friend parr-t!'s juru a great resource for using other open source parsers for use with lucene, some of which even output a format specificly designed to be index by lucene. The url of the Jguru lucene faq pages is http://www.jguru.com/faq/subtopic.jsp?topicID=473821. The front end search engine while built, would take some modification in order to do what we want, and for what we plan to search by and what output we entend to have, as well as the manner in which we will need to retrieve our entries, I think Nutch may not be quite what we're looking for. This is my standpoint as of right now, who knows.. time may go by and I decide otherwise, but currently that is what I have found in regards to Nutch.
Posted by cfrasche at September 9, 2004 05:31 PMA couple of notes on nutch:
+ Nutch is built of plugins. All of its behavior can be changed by appending or replacing plugins. Means your changes are localized. It makes it easy to divide up the work.
+ A content-type switched parser framework is not an insignificant amount of work. Nutch already has this with tested parsers for the main doc types (msword, pdf, with excel on the way).
+ Nutch is i18n; it takes care of character encodings end-to-end and even takes a good shot at language recognition.
+ In general Nutch is thinking big; Its thinking distributed from the get-go and already has first cut at clustered querying.
If you go with nutch you can avoid time spent messing with lucene read/write lock cleanup and too-many-open files issues.
The nutch webapp is i18n but its awkward to work with with its mix of jsp, static pages and xsl produced pages. I could see jettisoning this part of nutch.
Posted by: St.Ack at September 13, 2004 03:38 PM