Vicaya Javascript Search
A year ago, I took it upon myself to help make a searchable CDROM of a website. The corpus consists of entirely (or mostly) plain vanilla HTML, in the area of 5000 documents, weighing less than 40 megabytes. Ideally, the search could be performed in any web browser, on any platform, with no installation, running entirely from a CDROM. Thus the Vicaya project was born.
Initially, I wrote the entire thing as a Java applet using the Apache Lucene library. I had limited success communicating between the browser and the application while maintaining reasonable performance. As I was doing no better than existing solutions, we started examaning a stand-alone Java application called DocSearcher.
Then I was introduced to KSearch Client. I had to hack it a bit to get it to work, but it definitely grabbed my attention. The product doesn’t scale well as the size of the corpus increases, and the code was rough around the edges. While KSearch was inspiring, if a javascript solution is to work, it will require a complete re-write.
First, the indexer will need to rip out common words (a, the, and) (stop words). Second, the indexer will need to quickly render, renders, rendering, rendered, render words via stemmer such as the Porter Algorithm. Third, store the terms in a modified associative collision array, recording term-document-frequency tuples (with at least the term as key) and a docid-url index.
The searcher will parse the query, stripping the same stop words, filter via the stemmer, and use some fast search algorithm such as vector spaces or a modified disjunctive search. My version so far is as follows:
for each query term (sigma): doc-term-weight =
term frequency * log(number of documents/documents containing term)
Though I’m open to suggestions! For example, how do I normalize the weight (Norm(d))?
I’m glossing over a few points: applying some boolean filtering (AND, OR, NOT). I wonder if I could fudge some of the steps a bit before the sigma. For example, suppose two query terms result in 500 documents each. I wonder if it makes much difference if I consider only the first 50 each. Anyway…
I am considering writing this as XUL… or not. Another consideration is the hit page generation, which will likely need to read a set of pages to generate a summary. Fortunately, I’ve already written this in Java Vicaya.