Here is the API documentation for Chronica so far, some comments need to be updated, etc. Will be updated.
The current searchable index contains 2,194,824 documents. Thats the combination of the UK docs from the IA, as well as the local crawls done here at USF.
The total indexing took 57 hours to do 1,473 arc files totaling 104GB of data, creating a local lucene index of 1.5GB.
104GB to 1.5GB ... not bad I would say.
Do to some serious performance issues, this index is without pdf parsing enabled, so I can't imagine the overall index size could possibly be much bigger that what it currently is, due to the face that we only use the raw text.
We're sitll working steadily.. so stay tuned for more updates!
I added the XML config reader to the respository and added support to ARCRecordReader...
ARCRecordReader can now function without cmd line args... assuming the config file is configured and in the same directory. The reader will still work with cmd line args. I still need to more thorough testing on it. Ask me if something is not clear. The ChronicaConfigReader can be called by first calling ChronicaConfigReader.getConfigReader(). Then call getPrameter(ChronicaConfigReader.parameterKey).
Should we still be checking for record type in ARCRecordReader? Should this be read from the config file? Is there anything else that can be cleaned up in there that we don't need any more?
Since we're now able to index real data that's already in the wayback machine, I set it up to link straight to it...and it works. As long as the page is in the wayback machine, the links will work. Feel free to test it out. Other than that...I've done some refactoring and asthetic work...StringTemplate has proven quite useful...the entire search field is now a template, and I've set up the links for search results with multiple pages...the regular expression I had to use for that was giving me trouble, but I finally got it. Fixed up the passing of the search terms string, so that string with quotes in them won't get cut off when they're passed from page to page...plus signs are causing issues though, because of the way stuff is automatically encoded and decoded. Other than that though, everything seems to be running smoothly...now to work on some of the real time-based searching features Chris and I have been discussing.
Well, i can't find the cd with the back up of my inward links code... crap. Anyway, it's not as bad as it sounds because i was probably going to rewrite the whole thing anyway. The code had a lot of memory holes in it and would probably have caused the indexing to be a lot slower. my code is basically going to be a plugin to rudd's that'll take the contents of the page (String), create a page object with some index, walk through the string looking for a hrefs, and then push them into the hashtable of links.
From an email i sent earlier to day:
Fellas,
http://chronica.cs.usfca.edu/chronica/src/
that address will point you to the src that the web ui is currently running off of... it's essentially our 'stable' build. There's some jsp pages in another directory that take care of a lot of the html end, let me know if you'd like that accessible too and i'll make it available.
The bulk of the code is in three files...
ARCRecordReader is where all the arc file io is done....
ChronicaInterface is where all the indexing and searching happens
WebInterface is where the ui hooks into make all this craziness work.
ARCRecordReader isn't really used on the ui end, except in our current case of viewing the content of a file, which will soon switch from our viewing to the wayback machine.
-Chris Fraschetti
I ran one of our larger arc files with 215 pdf and 1287 html formatted docs...
I timed each parse and calculated the average parse time for each ..
PDF: 3.89 seconds
HTML: 0.019 seconds
Now I found a heavy text intense pdf and from a combination of many
pages and web search results created a very text and tag intensive
html doc....the html doc was 5.81mb and the pdf was 4.57mb
HTML: 2.46 seconds
PDF: 3 minutes, 24 seconds
... excluding i.o. time...
so yeah, definitely going to need a some threading here... :)
So I finally had enough iterations through indexing the large set of arc files we've gathere and cleaning up bugs and errors in the code/files to where I was comfortable letting the index finish and to try searching on it.
There was a small hickup. I had origionally index the date field in the seconds since epoch format, which on a small index worked fine... BUT... a large index proved a limitation in lucene, but no worries.. I took care of it.
Aparently adding +date_field[SOMEDATE TO ANOTHERDATE]
evaluates when added into the query as a sequence of boolean clauses... which on the scale needed for comparing the epoch timestamps, exceeded the default 1024 allotted clauses for the search. I cranked that up to Integer.MAXINT just to test the search... but .... that just gave me java heap issues...
The reason for keeping the epoch timestamp was to keep the granuarity the arc files allowed, but for a date based search, the hour of the day is pretty useless, so I switched all the code over to the yyyyMMdd format... which required me to completely reindex... but alas... success.. i can search out sum 71000+ 'docs' in no time at all.
On a side note.. indexing approx 5.5gb took a little over 3 hours ... a nice chunk of time was tacked on when I added PDF parsing... (mainly because a lot of our crawls include MANY MANY professor's lectures in pdf format) ... perhaps once we get larger and more diverse crawls, I can get better statistics on our indexing.