In the previous article, we designed a web server protocol for searching and
updating small chunks of information, called Brain Entries, that are stored in
BrainFeeds. The sample client is a JSP program that displays the entries in a
web browser. Now it would be nice to have a really good thick client that would
let us do real-time searches, local data caching, and properly render the
entries in the client itself instead of in a web browser.
In this article, we are going to build a desktop application to read and
post to BrainFeeds. Since it's a real application, we will also be able to cache
the feeds and do incremental updates to disk. This lets us do fast real-time
searching through the local cache. Also, since we won't have access to a browser
anymore, we will customize the HTMLEditorKit to render each entry as
HTML directly in our application.
The Base Program
Like most desktop applications, we will start with a simple base. Our
application has one frame with three sections (as seen in Figure 1). The top is
a search box, the middle displays the results in a list, and the bottom renders
the selected entry as HTML. The bottom two buttons are for editing the currently
selected entry and for adding a new one. You can download the source code for this application here: brainfeed.zip.
Figure 1. The base application
Real-Time Incremental Searching
The first feature we'll add to make our application really nice to use is real-time
incremental searching. This is a method of searching most prominently featured
in iTunes, though you can also find
it in text editors (like the venerable XEmacs),
file managers, and even the combo boxes of some applications. The two key
points of real-time incremental searching are that the search is run over again
on each keystroke, and that the user can search for substrings. This means that
a search for "ten" would match "ten," "tent," and "forgotten." These two
techniques combine to create a great user experience, but at the cost of
processor speed and disk space for an index. Fortunately, we live in the age of
cheap and powerful computers that waste most of their resources waiting in a
loop for a mouse click. Incremental searching can be slow, but for the datasets we
will be dealing with (say, less than 20MB of pure text), on modern
computers it should be nearly instantaneous.
So how do we do it? First we need a powerful database with support for
wildcard searching. Lucene is a 100% Java, open source
search engine that supports almost everything we need. It was written by the
author of Apple's VTwin search engine, and supports both full-text and wildcard
searching. Now adopted by the Apache Jakarta project, it provides top-notch
searching for any Java application. We just need to hook it up.
Creating an Index with Lucene
First we need to create an index on the client side to store all of our
entries. The index contains all of the words that we can search on, presorted to
make searching faster. It also lets us set some options about how to deal with
spaces, plural words, and other language issues.
File indexDir = new File("braindir");
// the stop analyzer breaks the text on word boundaries
// converting it all to lower case and stripping out the stop
// words (like "the", and "a")
Analyzer analyzer = new StopAnalyzer();
if(writer == null) {
try {
// create a new indexwriter.
// the false means it won't overwrite the old index
writer = new IndexWriter(indexDir, analyzer, false);
} catch (IOException ex) {
// create a new index writer and overwrite the old index
writer = new IndexWriter(indexDir, analyzer, true);
}
writer.close();
}
The code above will create an index in the braindir directory.
The first call to new IndexWriter() will open the index without
creating it. If the call fails because the index doesn't already exist, then it
will make the call again with true for the last argument to create a new index.
The Analyzer is a set of rules about how to preprocess the data
before putting it into the database. The StopAnalyzer, one of the
default Analyzers that comes with Lucene, will convert all text to
lowercase and remove stop words. Stop words are short words like "the" and "a"
that convey little or no meaning and are not useful for searching. We can leave
them out to speed up processing and make the search more targeted.
Now that we have an index, we need to put the entries into it. Each entry has
already been parsed into a BrainEntry object (reused from the JSP
version), which has accessors for each field we will need. Lucene stores text in
Document objects, so we will create one Document for
each BrainEntry.
private static void addToIndex(File indexDir,
BrainEntry be,
boolean create)
throws Exception {
IndexWriter writer = getWriter();
// create a new document for the brain entry
Document doc = new Document();
// pull out all of the fields and put them
// in the document
String id = be.getId();
doc.add(Field.Keyword("id",id));
doc.add(Field.Keyword("uri",be.getURI()));
doc.add(Field.Keyword("iduri",be.getId() +
":"+be.getURI()));
doc.add(Field.Text("title", be.getTitle()));
doc.add(Field.UnIndexed("content",
be.getContentString()));
// add each keyword
Iterator it = be.getKeywordList().iterator();
while(it.hasNext()) {
String keyword = (String)it.next();
doc.add(Field.Text("keyword",keyword));
}
// add the document and close
writer.addDocument(doc);
writer.close();
}
First we add searchable fields to the Document and then we add the
content. Lucene has different types of fields depending on how they should be
included in the index. We want the id and source uri
to be keywords, and the title is text. A keyword field
is a string that will be stored and indexed but not tokenized, meaning it won't
be modified in any way. Since we need the id and uri external to the program, we
don't want them to be changed at all. A Text field is also stored and
indexed, but it will also be tokenized, which in our case will make it lowercase
and remove the stop words. All of the fields that we would like our users to
search on will be stored as text. For the content (the body text of the entry),
we don't actually want to index it for searching, since that would make queries
slower. Instead, we just want to use the database as a convenient storage
mechanism, so it gets stuffed into an UnIndexed field. Once our
Document is set up, we add it to the index.
Implementing a Real-Time Search
As we saw above, we write to the index with an IndexWriter. To
search through the index, we will use, not surprisingly, an
IndexSearcher. The query itself is derived from the
QueryParser, which takes our query string, the name of the field we
want to search, and the analyzer. We will use the same Analyzer
when we originally put the entry into the index; the StopAnalyzer.
Finally, we execute the search and loop through the results.
private static List luceneSearch(String q,
File indexDir)
throws Exception {
init();
List list = new ArrayList();
// create an index search
Directory fsDir =
FSDirectory.getDirectory(indexDir, false);
IndexSearcher is = new IndexSearcher(fsDir);
// create a new query based on the
// query string passed in
Query query =
QueryParser.parse(q, "keyword",
new StopAnalyzer());
// do the search
Hits hits = is.search(query);
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);
BrainEntry be = new BrainEntry();
be.setId(doc.get("id"));
be.setURI(doc.get("uri"));
be.setTitle(doc.get("title"));
be.setContentString(doc.get("content"));
Field[] keywords = doc.getFields("keyword");
for(int j=0; j<keywords.length; j++) {
//u.p("keyword: " + keywords[j]);
be.addKeyword(keywords[j].stringValue());
}
list.add(be);
}
return list;
}
To create an incremental search, we need to modify the query. Lucene doesn't
support complete substring search (where a search for "oo" would return "noon"),
but it does support prefix substrings, meaning a search for "jav" will return
both "java" and "javascript." This is done by adding a wildcard ("*") to each
term. Years of Googling have conditioned people to continue typing words to
narrow down a search, so we will just AND the search terms together into our
final query string.
public List search(String[] terms) throws Exception {
// return empty array if empty query
if(terms.length == 0) return new ArrayList();
StringBuffer query = new StringBuffer();
// add the first term with a wildcard (*)
query.append(terms[0]+"*");
// AND all of the additional terms
// with *'s after them
for(int i=1; i<terms.length; i++) {
query.append(" AND " + terms[i] +"*");
}
return luceneSearch(query.toString(),
this.indexdir);
//return bes;
}
Building a Better Brain, Part 1: The Protocol
Joshua Marinacci wants to build a distributed system for storing, searching, and updating small pieces of information. In this article, he shows how Java-friendly standards like XML and HTTP will make up the foundation of his BrainFeed web application..