I’ve started looking into search engines for indexing large amounts of data. Although the primary reason for this is to start organizing my personal data collection, it will also come in extremely handy on the file servers at work. I’ve started to reach critical mass when it comes to the amount of random garbage I have floating around on miscellaneous file servers and I now need a much better indexing scheme.
Excite was the first search engine I ever set up and integrated with a Web site. The first versions worked quite well but, as they improved the interface, it just stopped working on newer versions of Linux to the point where it wasn’t even installable. This was 1996 or 1997 and Slackware was the distribution. I just mention it now for nostalgic value.
ht://Dig works quite well for smaller sites but we’ve had some issues with it at my employer. We’re indexing tens of thousands of documents and the software seems to get stupid when it comes to counting that many items. It’s quite common for it to try to display the wrong number of pages, the wrong number of hits per page or duplicate hits spread out over multiple pages. Also, it’s an HTTP indexer that tends to do the wrong thing with dynamic content (WebGUI and Anthill) and the load is increasing on our servers with the nightly index so we need to move over to a local indexer.
Namazu is a wet dream when it comes to ease of use and getting up and running quickly. Working as a local indexer, it recognizes dozens of MIME types and parses accordingly. If I were forced to make a choice right now, Namazu would get installed and I’d call it a day. It’s fast, it’s lightweight, it’s simple and it supports storing/searching multiple indexes and virtual hosts via directory aliasing (local directory to virtual host URL mapping).
The indexer can get run out of cron and it only updates files that have changed since the last run. Checkpoints are also available so an aborted run can be picked up later. My only beef is that the displayed results are unintelligent where it only displays the first paragraph of a hit instead of jumping to the sentence with your search term (this may be a configuration option that I haven’t found yet).
I just started looking into this package and, at first glance, it appears to be quite robust. The immediate thing that stands out is that it uses MySQL as its backend instead of flat files. It supports local, HTTP and SQL indexing so it can be used to search multiple resources of multiple formats. The SQL feature is quite cool and become a deciding factor if I decide to archive all my data in MySQL tables.
Time to do some more reading…