October 01, 2004

First Test of Large Index a Success!

So I finally had enough iterations through indexing the large set of arc files we've gathere and cleaning up bugs and errors in the code/files to where I was comfortable letting the index finish and to try searching on it.

There was a small hickup. I had origionally index the date field in the seconds since epoch format, which on a small index worked fine... BUT... a large index proved a limitation in lucene, but no worries.. I took care of it.

Aparently adding +date_field[SOMEDATE TO ANOTHERDATE]

evaluates when added into the query as a sequence of boolean clauses... which on the scale needed for comparing the epoch timestamps, exceeded the default 1024 allotted clauses for the search. I cranked that up to Integer.MAXINT just to test the search... but .... that just gave me java heap issues...

The reason for keeping the epoch timestamp was to keep the granuarity the arc files allowed, but for a date based search, the hour of the day is pretty useless, so I switched all the code over to the yyyyMMdd format... which required me to completely reindex... but alas... success.. i can search out sum 71000+ 'docs' in no time at all.


On a side note.. indexing approx 5.5gb took a little over 3 hours ... a nice chunk of time was tacked on when I added PDF parsing... (mainly because a lot of our crawls include MANY MANY professor's lectures in pdf format) ... perhaps once we get larger and more diverse crawls, I can get better statistics on our indexing.

Posted by cfrasche at October 1, 2004 01:42 AM
Comments