October 20, 2004

Documentation, first go.

Here is the API documentation for Chronica so far, some comments need to be updated, etc. Will be updated.

Posted by rstevens at 04:51 PM | Comments (0)

October 19, 2004

Behold..

Time-plot search
The time-plot search is working. I set the images up using this gpl chart package I found on sourceforge, and it'll graph any query or combination of queries that you give it (see the "need help?" button for more info). Only a few other minor changes besides that, but I'm really excited I was able to get the chart working. If you give it a generic term over all the data, you can see where our data clumps are...refine the search and you can get a clearer picture of the term frequency. Right now it'll only count the number of results, but we're working on a better way to score queries.
Posted by defendio at 03:57 AM | Comments (0)

October 18, 2004

chroninca index update

The current searchable index contains 2,194,824 documents. Thats the combination of the UK docs from the IA, as well as the local crawls done here at USF.

The total indexing took 57 hours to do 1,473 arc files totaling 104GB of data, creating a local lucene index of 1.5GB.

104GB to 1.5GB ... not bad I would say.

Do to some serious performance issues, this index is without pdf parsing enabled, so I can't imagine the overall index size could possibly be much bigger that what it currently is, due to the face that we only use the raw text.


We're sitll working steadily.. so stay tuned for more updates!

Posted by cfrasche at 06:05 PM | Comments (0)

October 16, 2004

Added XML config reader

I added the XML config reader to the respository and added support to ARCRecordReader...

ARCRecordReader can now function without cmd line args... assuming the config file is configured and in the same directory. The reader will still work with cmd line args. I still need to more thorough testing on it. Ask me if something is not clear. The ChronicaConfigReader can be called by first calling ChronicaConfigReader.getConfigReader(). Then call getPrameter(ChronicaConfigReader.parameterKey).

Should we still be checking for record type in ARCRecordReader? Should this be read from the config file? Is there anything else that can be cleaned up in there that we don't need any more?

Posted by rstevens at 12:45 AM | Comments (0)

October 15, 2004

Now (almost) a real search engine

Since we're now able to index real data that's already in the wayback machine, I set it up to link straight to it...and it works. As long as the page is in the wayback machine, the links will work. Feel free to test it out. Other than that...I've done some refactoring and asthetic work...StringTemplate has proven quite useful...the entire search field is now a template, and I've set up the links for search results with multiple pages...the regular expression I had to use for that was giving me trouble, but I finally got it. Fixed up the passing of the search terms string, so that string with quotes in them won't get cut off when they're passed from page to page...plus signs are causing issues though, because of the way stuff is automatically encoded and decoded. Other than that though, everything seems to be running smoothly...now to work on some of the real time-based searching features Chris and I have been discussing.

Posted by defendio at 12:09 PM | Comments (0)

October 12, 2004

Inwardlinks code

Well, i can't find the cd with the back up of my inward links code... crap. Anyway, it's not as bad as it sounds because i was probably going to rewrite the whole thing anyway. The code had a lot of memory holes in it and would probably have caused the indexing to be a lot slower. my code is basically going to be a plugin to rudd's that'll take the contents of the page (String), create a page object with some index, walk through the string looking for a hrefs, and then push them into the hashtable of links.

Posted by jendo at 12:29 AM | Comments (0)

October 06, 2004

An e-mail regarding the chronica source

From an email i sent earlier to day:

Fellas,

http://chronica.cs.usfca.edu/chronica/src/

that address will point you to the src that the web ui is currently running off of... it's essentially our 'stable' build. There's some jsp pages in another directory that take care of a lot of the html end, let me know if you'd like that accessible too and i'll make it available.

The bulk of the code is in three files...

ARCRecordReader is where all the arc file io is done....
ChronicaInterface is where all the indexing and searching happens
WebInterface is where the ui hooks into make all this craziness work.

ARCRecordReader isn't really used on the ui end, except in our current case of viewing the content of a file, which will soon switch from our viewing to the wayback machine.


-Chris Fraschetti

Posted by cfrasche at 02:43 PM | Comments (0)

October 02, 2004

Some interesting stats on parsing...

I ran one of our larger arc files with 215 pdf and 1287 html formatted docs...
I timed each parse and calculated the average parse time for each ..

PDF: 3.89 seconds
HTML: 0.019 seconds

Now I found a heavy text intense pdf and from a combination of many
pages and web search results created a very text and tag intensive
html doc....the html doc was 5.81mb and the pdf was 4.57mb
HTML: 2.46 seconds
PDF: 3 minutes, 24 seconds
... excluding i.o. time...

so yeah, definitely going to need a some threading here... :)

Posted by cfrasche at 06:47 PM | Comments (0)

October 01, 2004

First Test of Large Index a Success!

So I finally had enough iterations through indexing the large set of arc files we've gathere and cleaning up bugs and errors in the code/files to where I was comfortable letting the index finish and to try searching on it.

There was a small hickup. I had origionally index the date field in the seconds since epoch format, which on a small index worked fine... BUT... a large index proved a limitation in lucene, but no worries.. I took care of it.

Aparently adding +date_field[SOMEDATE TO ANOTHERDATE]

evaluates when added into the query as a sequence of boolean clauses... which on the scale needed for comparing the epoch timestamps, exceeded the default 1024 allotted clauses for the search. I cranked that up to Integer.MAXINT just to test the search... but .... that just gave me java heap issues...

The reason for keeping the epoch timestamp was to keep the granuarity the arc files allowed, but for a date based search, the hour of the day is pretty useless, so I switched all the code over to the yyyyMMdd format... which required me to completely reindex... but alas... success.. i can search out sum 71000+ 'docs' in no time at all.


On a side note.. indexing approx 5.5gb took a little over 3 hours ... a nice chunk of time was tacked on when I added PDF parsing... (mainly because a lot of our crawls include MANY MANY professor's lectures in pdf format) ... perhaps once we get larger and more diverse crawls, I can get better statistics on our indexing.

Posted by cfrasche at 01:42 AM | Comments (0)