October 18, 2004

chroninca index update

The current searchable index contains 2,194,824 documents. Thats the combination of the UK docs from the IA, as well as the local crawls done here at USF.

The total indexing took 57 hours to do 1,473 arc files totaling 104GB of data, creating a local lucene index of 1.5GB.

104GB to 1.5GB ... not bad I would say.

Do to some serious performance issues, this index is without pdf parsing enabled, so I can't imagine the overall index size could possibly be much bigger that what it currently is, due to the face that we only use the raw text.


We're sitll working steadily.. so stay tuned for more updates!

Posted by cfrasche at October 18, 2004 06:05 PM
Comments