Large-scale indexed and searchable personal archives?

Like most geeks, I’m a complete packrat when it comes to collecting copious amounts of seemingly useless information. Years of saved email, textfiles, Web pages, ZIPs, PDFs, Word documents, etc. Underneath the surface of that raw data is the occasional gold nugget of priceless information that, although not yet of value, may some day shed light on a future conundrum. It is for this reason that I have been religiously archiving all raw inbound and outbound data processed by my collection of machines since the early nineties. And it is this massive collection of information that has saved my butt on more than one occasion as I pulled an answer to a problem out of thin air because I remembered that someone emailed me about a similar issue four and a half years ago.

Over the years, my collection started out as a few megabytes and now spans a hundred or so CD-Rs and the majority of an 80 GB hard drive. As more files are collected the archive grows and becomes even more unwieldy. Unfortunately, this has turned the archival process into more of a frustration than anything else. Find and grep only go so far and a misplaced answer is worse than no answer at all. I won’t even touch on the subject of how many times I’ve searched for a piece of information that I know I have but can’t find due to my lack of a proper index and search capability.

So, dear friends, my question is this: how do you store your data?

I started doing research on the topic a few weeks ago and have found that there are very few operational open source solutions available for personal archiving, indexing and searching. And I’m not just talking about dumping a bunch of files into a Web tree and smacking it with ht://Dig. I need a full cross-referenced database of every single piece of information that one can collect. Text, HTML, images, audio, binaries, etc. If there’s no indexable text, there needs to be a keywords field to label the binaries. The front-end doesn’t matter…as long as a piece of information can be found once it goes in.

If no obvious solutions creep into the foreground, I’ll just keep everything archived in a directory hierarchy and hit the entire tree with Namazu. Email will be pre-processed with archmbox and MHonArc before getting hit with the search engine. This will make things easier to find but there won’t be any advanced capabilities that commercial content management systems provide.

If anyone has any new solutions, please pass them along. If anyone has any specific experience with Namazu or MHonArc indexing of hundreds of thousands of documents, that’d be quite helpful as well.