about
RCrawler is a web crawler and indexer bundle designed
for internal website searching. Originally designed
as part of a research project called RiVeR (Reliable
Virtual Resources), RCrawler started as a single
threaded web crawler and indexer.
Both the crawler and the indexer are fully distributable across multiple virtual or physical machines providing expected speed improvements. All three components were written in python, and are currently single threaded with reasonable performance.
Strider, the crawler portion of the RCrawler bundle was written by myself, while Castor, the indexer portion was written by Tony Ngo
Both the crawler and the indexer are fully distributable across multiple virtual or physical machines providing expected speed improvements. All three components were written in python, and are currently single threaded with reasonable performance.
Strider, the crawler portion of the RCrawler bundle was written by myself, while Castor, the indexer portion was written by Tony Ngo
Strider features
- Keep-alive support: For better performance through the reuse of connections
- Full robots.txt support: To reduce drain of server bandwidth
- Full support of HTTP redirects (301 and 302)
- Fully configurable: Using a config file, Strider allows the user to limit crawls to specific domains, filetypes, and filesize. A list of exclude words may also be provided to limit the URLs that Strider will visit.
screenshots
- Click here for screenshots