Local Search for the Masses
I have been rambling on about local search for several months, recently focusing on a local java web server. Today, I recreated the wheel for the second time. Thus some documentation. What follows is a “HOWTO Build Your Own Local Search Engine”.
I would be happy to patiently help anyone willing to embark on this recipe.
You will need the Java SDK (JDK) and JRE, the Jetty webserver, Apache Lucene Nutch, something worth searching, and about two hours (though most of the ‘baking’ time can be spent reading a book).
Jetty
Get Jetty Jetty 4.2 (6-11 MB; 4.2.24 or later, but not 5.x).
Jetty-5.1 requires jdk1.4 (not predictably available for the Mac). Also, I had not been able to get it to work with Nutch. Considering other web servers (such as Resin), but for now, Jetty 4.2 is excellent.
Nutch
nutch 0.7 (44MB)
(I’ve downloaded nutch-0.7.tar.gz, and know it works on MS ME, XP, Mac OS X 10.3, and many Linux distributions).
Java
We will need both the JDK and JRE. For distribution purposes, we should target 1.3, but for your own local uses, the higher the better (currently Java 1.5 or Java5). From the command prompt:
javac -version
Should say something like:
javac 1.5.0 or javac 1.4
javc: no source files
[...etc...]
(something else on mac)
If you do not have Java installed, you’ll need to pick it up from Sun (J2SE 5.0 mucho MB).
Corpus
In this case, I’ve downloaded the latest (as of mid September 2005)
ATI Bulk (30MB).
Extract
- Ensure compatible/latest Java JDK and JRE. Extract Jetty, Nutch, and the corpus.
- Move the corpus to jetty-x/webapps (I’ve renamed ati_website/html to ati and placed ati into webapps).
Start Jetty
java -jar start.jar
wait a few seconds (and lots of log garbage) until
XX> Started...
Localhost test
Point your browser to:
http://localhost:8080/CORPUS/
(replace CORPUS with perhaps ati)
http://localhost:8080/ati/
Indexing
Reference: http://lucene.apache.org/nutch/tutorial.html
- Move to the nutch directory.
- Create a file called
atiurl.txtWith one line (such as):
http://localhost:8080/ati/index.html - Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with localhost.
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://localhost:8080/ - Run the indexer:
bin/nutch crawl atiurl.txt -dir ati.crawl -depth 15 >& ati.crawl.log
Server Setup
copy nutch-0.7.war into jetty-4.2.x/webapps
copy ati.crawl/segments into jetty-4.2.x/
Restart Jetty
^c
java -jar start.jar
Test deployment
http://localhost:8080/nutch-0.7/search.jsp?query=Buddha
The first time will be quite slow (because on the first hit, we’ll be extracting and compiling the entire deployment). Try another query, and the results should be nearly instantaneous.
Done
Future mods
It should be possible to precompile the web application such that the pre-load time will be faster and so that the JDK will not be required for end users (only the JRE).
I’d like to set it up such that a user could download one bundle containing the corpus and index so that the user need only drop a single directory into a specific location and resume where they left off (perhaps without even restarting the local web server). Asking the end user to copy this here and that there and restarting might be a tall order for some.
Need to test this on CDROM containing JREs for multiple platforms on computers lacking everything.