Ever had that feeling that you know you have something…but just can’t find it! – Crawl, index, search

Well this is what I get all the time on computers, with enormous amounts of data, from work stuff, pdf librarys, drawings, pictures you name it. I am fairly careful about organising stuff and and fairly consistent in the backing up, but there comes a point where you need some form of effective search, I am not talking about the search crap that comes with any O/S, these take days to search thru pdfs and documents. Vista’s search has not yet convinced me either! So I went on a search for decent, opensource or at least free searching and indexing software. I am not fussed if it runs on a desktop or off a server thru a java applet or some other means.

I was looking for something that was relatively simple to install and configure, I dont mind tweaking stuff and spending a bit of time to get stuff to work, but taking days and days to install and sort out problems is not good enough! I wanted somthing that, ideally, could search thru MS format files, Open Office format files, pdf’s, generic text files and CAD files, it also had to provide results based on file name and path, not really much to ask. It is important to note that g$$gle provides some SOHO options for this kinda stuff, in particular the g$$gle mini, this little box would do most of what I need, but is limited to 200,000 docs and cost shitloads of money, probably would not reach 200,000 docs but the 6kUS is a killer…..so off to the alternatives I went 🙂

After some searching and downloading I came across several workable solutions – mostly based on java these are listed in my preference based on my needs and level of expertise!

IBM Omnifind Yahoo Edition – http://omnifind.ibm.yahoo.net/

This was by the far the easiest, most advanced (in terms of development) and provided the best results from all of the softwares that I tested and looked at. The only issue when installing was a missing Java RHEL compatibility package, once this was yummed on my test server the install went very smoothly.

The software has a web interface for configuring and searching and uses a port off its own java applet server, Jetty, I think. The download package includes its own Java runtime environments which alleviates the pain of trying to get the right version, for that matter, a working version of Java.

The crawling process is pretty resource hungry but seems very quick for what it is doing, the results are even more surprising, lots of results and fairly relevant ones at that, out of all the software that I tried this picked up the most file types and search the most files. Sometimes the crawler does not index every file but that is something I am working on. I currently have it indexing over 200,000 files and it only results in an index size of 4-5gb, thats without caching the files….

The catch with this software? well it is not entirely Opensource, it uses the Lucene package but also incoporates some fairly heavy stuff from Yahoo and IBM, they have stated also they do not plan to make this paryicular version a paided one. They have a entrprise version for more than 500,000 files. The seem to be trying to get a foot in the searching world by providing a free version to entice people/companies in. Probably not such a bad idea, g$$gle really needs some proper competition.

Regain – http://regain.sourceforge.net/

Had a good go at this one, tried both the desktop and the server versions, relatively easy to get going, the configuration files can be a little confusing, it can search thru several different file formats, however I think you need to tweak your config files to get the best results, the crawler/indexing seemed to take forever and even locked up on very large repositories of complex pdfs. This one gets number two on my list. The desktop search is pretty neat for something that can be installed in a matter of minutes! Quoted – “regain is a search engine similar to web search engines like Google, with the difference that you don’t search the web, but your own files and documents. Using regain you can search through large portions of data (several gigabytes!) in split seconds!”

Terrier – http://ir.dcs.gla.ac.uk/terrier/

Did not spend a huge amount of time on this one, was hard to get to work, seems to be in the early stage of development? “Terrier is a highly flexible, efficient, effective, and robust search engine, readily deployable on large-scale collections of documents. Terrier implements state-of-the-art indexing and retrieval functionalities. Terrier provides an ideal platform for the rapid development of large-scale retrieval applications”

Egothor – http://www.egothor.org/

Did not try it but it seems to be fairly highly regarded

Lucene – http://lucene.apache.org/java/docs/index.html

Basically most of the search tools I looked at utilised this software package. Quoted from their webbie – “Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform”

avernus

Ever had that feeling that you know you have something…but just can’t find it! – Crawl, index, search

Leave a Reply