Release 0.1 of Chronica is now active....
We have now completed the following components of Chronica:
== ARCRecordReader: will read the records from arc files in a given directory and
index the content by calling ChronicaInterface to create or add to a Lucene index.
== ChronicaInterface: Manages the Lucene index and the queries from the Web Interface.
== ChronicaParsers: Contains the parsers used to parse the content of different document types, html, pdf, etc. Currently limited to html, plain text and pdf.
== WebInterface: Provides the user interface to enter keyword queries that search the Lucene index and display the returned results. Also handles opening a clicked link by calling ARCRecordReader to request a specific record from a given arc file.
There are other subcomponents, but these are the main parts.
The next phase includes adding more parsers for additional document formats, integrate more link and temporal evalulation into the query results and link the search results to the Wayback Machine.
After taking the time to clean up the summaries and titles of the html parser otuput, I moved on to adding more functionality.. PDFBox was the solution to pdf files. So the current index built off a small crawl from Rudd has several pdf files whose content is searched upon. Good times.
Yeah, so check your email, I fixed the weird character bug in the ARR.extractText()....
My bad, collecting the read data wrong.... I think we are good to go for thursday....
Okay more work on the interface/related code...Chris and I have gotten the search time and total results showing, and I've written a cache (on-disk) for the results, populated on demand. Having some slight issues with header info showing up when you look at a result, and currently working on multi-page navigation for the results page. Also...still no images/embedded content for the results, but that'll be the next hurdle.
The ARCRecordReader is updated ....
post updated on 9.26.04 @ 14:20 extract record now returns a File object and the file name is of the form arcfilename_recordoffset.mimetype extension
FYI,
The hash.dat file that is generated is a result of Jason's hash of the files indexed that is updated upon completion of an indexing run. You have to delete this file if you want to reindex any arc file in the same dir, else the ARCHash will skip it when it passes the list of files to index to the ARCRecordReader.
I added a extractRecord(File file, int offset) for Deniz that will extract a record and write it to a file in a ./tmp dir and return the file object. This is for the cache that Deniz started. The file name is of the form arcfilename_recordoffset.mimetype-extension but this will have to change as its not unique. Perhaps a random number on the end or some hash of the url...
The ARR will now index all records as well, but only reads the body content of text/html mime types currently.
Today I sucessfully received an arc file from Rudd, and using the code of Rudd, Jason, Deniz and myself... we can now read, parse, index, and search on data from arc files.
Current functionality is limited to plain text, but as we all know, that is simply a matter of reading mime-types and parsing data. We still need to work on displaying the actual page stored in the arc file.. but again.. thats already proven(wayback machine) .. and it's simply a matter of time before we implement that, and being indexing on a larger scale. We also have a few ideas going around for what sort of statistics we should try to gather with the data we will have, any comments or ideas would be welcome
The interface is almost done...everything on it so far works, and as soon as getting the data out of the archives is completely done, the links to the results will work. The lucene ranks for each result are displayed as a score bar, and I've set it up so that the results page can be bookmarked (the form data is now placed in the url, so you can jump directly to it).
FYI, if you want to view sample results, use the search queries "currently" or "does nothing" with the default date range to get mockup results. In order to see how the score changes based on the input date range, you can change it based on the results' dates.
I added the field long offset to the ChronicaRecord object as we need that to pull the record from the file during a search. The latest is in /chronica/src/chronica/
As per some requests, I've redone the main page. Currently setting up the search results page - right now it'll populate the search fields with what you just searched for, and print out debug text for each variable. Since we have the search result object created now, displaying results should be a snap.
Read below for outline of ARCRecordData object that will be passed from the ARCRecordReader to the ChronicaInterface (Chris: this is what will be coming to you). Let me know what fields need to be added etc.
ARCRecordData:
// Contains the data associated with one record from an arc file.
Data members:
String bodyText; // string of text from the body of the record
String indexString; // index string for Lucene to allow reference, consists of full ARC file path/name
// and offset of record into arc file of the form: fullpath??offset
ARCMetaData metadata; // metadata from the record.
(apache) Header[] headers; // http headers from record, taken from the arc file.
NOTE: This is for text records only at the moment.
One other issue is how to link embedded jpgs to the html document. How does the wayback machine do it?
See link here for a rough diagram of the Chronica parts and their interactions.
The ARCRecordReader is finally functional....
It will loop through all the files in a directory and pull the (text) records from the arc file and put the text along with the Http headers, an indexString consisting of full arcfile name path and absolute offset into that file for that record, and the records metadata into an ARCRecordData object. This will be passed to the ChronicaInterface indexer to be indexed into Lucene.
TODO: How to link embedded images to html pages returned? For the seach and results part....
I've had success on custom indexes that i've built with searchable content and dates... our current implementation will use the unix epoch as the stored date, which will make for easy filtering..
the only problem I ran into and took me a bit to figure out is the fact that lucene treats all ranges as strings... therefore 1,2,3,4,5,6,7,8,9,10 is not treated as those literal numbers, instead 01,02,03,04,05,06,07,08,09,10 should be used... just a little FYI for you all.
Deniz has coded up a nice front end and is looking into a nice popup calendar, Rudd and Jason are almost ready to combine their code with mine, we're just having a little issue with reading the raw data from an arch file. Once that gets straighted out, we'll be good to go.
I've setup chronica.cs.usfca.edu with resin and apache to serve up our project.. all is going well.
The interface is coming along...the jsp page is here, and the interface for searching the index and displaying results is underway. More later.
I have been looking at subversion as the source control instead of CVS, see below of more info.
Go to this link here to read a rather informationve user guide/ book on use and operation. Looks to be a nice combination of CVS and some features of perforce. Worth a look.
Subversion homepage is here
We hashed out some more stuff today and allocated the first parts to the team members...
Deniz is looking into how to integrate the Nutch indexer and interface into our ARC file record reader. Jason is working on keeping track of indexed vs. non-indexed ARC files in conjunction the record reader, which Rudd is working on. Chris is experimenting with Lucene and collecting info on how best to configure that whole scheme.... Must get back to coding again....
I'm not sure if this will be possible in the time given this semester, but i think it would be a really good idea to not only use the crawl date when indexing the arc files. It wouldn't be too much of a stretch to rip dates out of the documents themselves and use them in the indexing process. For example, obviously there won't be any crawl dates from dec 7, 1941. but that date can be found very frequently in articles and stories about pearl harbor. We could separate the crawl context from the document context. We need to be careful when searching this though because it will effectivly create to instances of a certain date for a document... Perhaps when searching temporally we can have a normal text entry for the search string, and some select menus to select the crawl time context...
chronica@cs.usfca.edu
has been created to mail both the USF and IA associated folks.
We now have a simple Lucene indexer and search interface.
Currently only the document path (a string) and the document contents (string) are indexed. The ChronicaInterface has a resetIndex() function, closeIndex(), indexDocument(String filename, String conents), and chronicSearch(String search_term) ...
I've written a chronicaTest which indexes a few strings and labels for them and then searches.
Currenly chronicSearch returns LinkedList of the filenames found.
java 1.5 gave me a bit of pain when doing LinkedLists... aparently it's a but pickier these days about casting, types, etc etc...
For simply storing a linked list of strings....
LinkedList filenames = new LinkedList(); in java <= 1.4
LinkedList<String> filenames = new LinkedList<String>(); in java 1.5
strangely enough, when catching the return where filenames is sent.. you simple use LinkedList results = chronicSearch("some_term"); no <String>
eh, it works.
Yesterday the team met and began to discuss a bit furthur exactly what needs to happen in order to meet our first release.. yada yada yada ... more importantly... our machine, chronica.cs.usfca.edu is almost up and running, we spent the day yesterday setting it up.
There are still a few kinks in the networking aspect we're working out, but i suspect our russian amigo will hook us up with the support we need soon enough. We have available about 150gb of space dedicated to crawls on that box, so we should be good to go for our testing purposes.
Here are a few questions that we need to answer: See expanded section...
-> How to return a page from an ARC file to the user, i.e. extraction and display. How does IA do it? Is the page cached and then displayed or is it dynamically created straight from the ARC file?
-> (Thinking ahead to D2) What language should the Web interface be in? Java? php?
-> Should we create a completely separate application to index the ARC files after they are created or shoul d Chronica be linked to Heritrix? (i.e. as a extra filter/processor that indexes each ARC file after it is created by Heritrix).
-> Should the links (including offsets into the ARC file) or the actual ARC records be passed back to the user interface when the Lucene results are collected?
I decided to take a look at nutch and see how useful it might be in regards to what our overall goal is for the project... and didn't think it was quite was we are looking for, better yet.. I think we should write our own functions and web front ends to better suit out needs .... Read on for my reasoning...
Nutch for our purposes would be the parser of multiple document formats for the use of them being index into lucene, and it would be the front end search engine to search the lucene db and display the reults in some sort of organized fashion. This is all fine and dandy, but it is my thoughts that Nutch, while snazzy and already written, may be a step sideways vs forward. Nutch does currently (althogh not fully supported) have a pdf parser along with a few others, but not very many. I have found via out friends at google and our good friend parr-t!'s juru a great resource for using other open source parsers for use with lucene, some of which even output a format specificly designed to be index by lucene. The url of the Jguru lucene faq pages is http://www.jguru.com/faq/subtopic.jsp?topicID=473821. The front end search engine while built, would take some modification in order to do what we want, and for what we plan to search by and what output we entend to have, as well as the manner in which we will need to retrieve our entries, I think Nutch may not be quite what we're looking for. This is my standpoint as of right now, who knows.. time may go by and I decide otherwise, but currently that is what I have found in regards to Nutch.
Below is an outline for the projected timeline of the Chronica project, including goals and componenets for each deliverable .... Click to see the outline...
Chronica Project: Internet Archive Search Engine for Archive Files.
Overview:
Create a search engine that will work in conjunction with a web crawler
to index and create a Lucene index base of archive (ARC) files
created by the web crawler. Also develop a web interface to query
the ARC file index for searching by keyword and content. In addition,
create a time based search and display for queries into the index.
Initial application composed from components of Nutch and Heritrix web
crawlers.
Goals:
Deliverable 1:
Adapt either Nutch or Heritrix to do the following:
Output ARC files composed of web data collected and
index ARC files using Lucene to create an index that will allow
searching by keyword parameters. Content of index
will be limited to HTML and plain text content of ARC files at this time.
Deliverable 2:
Create a web browser interface to conduct search of ARC file Lucene
database. Add pdf, MS Word document, image search support to indexing
of ARC files.
Deliverable 3:
Create a time and date based search query for the Web interface
developed above to display text description search term significance and trends
on results page of web interface.
Deliverable 4:
Further develop time and date based search query for the Web interface
developed above to display graph of search term significance and trends
on results page of web interface in addition to keyword links.
Deliverable 5 (Final):
Total system described above including search engine/crawler
with web interface.
This is the Official Blog for the Chronica Project.
Regular updates will be posted as the project progresses.
Click to view details about the project...
Chronica is the Latin word for chronicle, meaning a "usually continuous historical account of events arranged in order of time without analysis or interpretation", which fits the idea of a temporal seach engine in conjunction with the Internet Archive.
Team members are Chris Fraschetti, Deniz Efendioglu, Jason Endo and Rudd Stevens.
Sponsors from Internet Archive include Igor Ranitovic and Brewster Kahle.
Sponsors from USF include Prof. Terence Parr and Prof. Oliver Grillmeyer.