Lucene lecture

November 23, 2004 by Doug Cutting

I am presently giving a lecture on Lucene at the University of Pisa. At the end of the presentation I’ll search for this blog entry, hoping to find it on Technorati, which uses Lucene, thus demonstrating live incremental indexing.

Lucene lecture

November 23, 2004 by Doug Cutting

I am presently giving a lecture on Lucene at the University of Pisa. At the end of the presentation I’ll search for this blog entry, hoping to find it on Technorati, which uses Lucene, thus demonstrating live incremental indexing.

Dynamization and Lucene

November 20, 2004 by Doug Cutting

Paolo Ferragina, my host in Pisa, pointed out to me some theoretical underpinnings for Lucene’s indexing method.

Internally, Lucene keeps a stack of segment indexes. Their size increases logarithmically down the stack. New documents are added as tiny indexes pushed onto the top of the stack. When there are b indexes on the top of the stack that are the same size, they are all popped off the stack and merged into a single index which is pushed back onto the stack. Thus the number of segments of each size resembles the digits in the base-b representation of the total number of documents. For example, if b=10 and there are 234 documents indexed, then there would be a four one-document indexes on the top of the stack, followed by three ten-document indexes and finally two one-hundred document indexes.

This keeps the incremental cost of adding a document fairly low, while keeping the number of indexes to be searched fairly small. It’s also still optimal for batch indexing, since adding n tokens still only requires order n*log_b(n) comparisons, as good as any other sorting algorithm.

Apparently this is all theoretically justified by Mark Overmars‘ dynamization work, done around 1980. This paper might be the appropriate reference.

Dynamization and Lucene

November 20, 2004 by Doug Cutting

Paolo Ferragina, my host in Pisa, pointed out to me some theoretical underpinnings for Lucene’s indexing method.

Internally, Lucene keeps a stack of segment indexes. Their size increases logarithmically down the stack. New documents are added as tiny indexes pushed onto the top of the stack. When there are b indexes on the top of the stack that are the same size, they are all popped off the stack and merged into a single index which is pushed back onto the stack. Thus the number of segments of each size resembles the digits in the base-b representation of the total number of documents. For example, if b=10 and there are 234 documents indexed, then there would be a four one-document indexes on the top of the stack, followed by three ten-document indexes and finally two one-hundred document indexes.

This keeps the incremental cost of adding a document fairly low, while keeping the number of indexes to be searched fairly small. It’s also still optimal for batch indexing, since adding n tokens still only requires order n*log_b(n) comparisons, as good as any other sorting algorithm.

Apparently this is all theoretically justified by Mark Overmars‘ dynamization work, done around 1980. This paper might be the appropriate reference.

personal web platform

October 1, 2004 by Doug Cutting

Battelle’s Web 2.0 Conference is next week. The theme is “the web as platform”, the trend of applications moving from stand-alone, with local data, to networked, with shared, remote data. As this happens, the details of one’s local operating system become less relevant. All you’ll need is a web browser, and perhaps a select few other web-based applications. With less stuff required of local applications, this forms a threat to Microsoft’s desktop dominance. My concern is whether another company might replace Microsoft’s desktop monopoly with a “web OS” monopoly.

In this web-based world, I’d like to keep all my personal data remotely, so that I can access it equally well from a Linux workstation, an Apple laptop, a Palm phone and a Windows-based internet-access terminal. Still, I’d like to leverage my local resources. For example, my laptop and handheld should be able to access my data while offline, and my workstation should be able to search it quickly using a local database.

Another big advantage of storing data remotely is that, if my laptop hard drive fails, or I get a new workstation, I don’t have to worry about copying all my stuff. I just log into my personal web and, voila, all my stuff is right there.

As implied above, I should be able to search all my data. Apple’s Spotlight will search my local private data on a Mac, as will Gnome’s Beagle and Longhorn on Windows. But synchronizing my data across these platforms is still a pain, so these don’t really solve the problem.

What’s needed to make this possible? We already have standards for accessing most remote data. Email can be accessed with IMAP, Address books with LDAP, files with WebDAV, chat with Jabber, etc. There are even providers for most of these services, and each has clients for most platforms. So what’s missing? A little glue, I think.

We need a standard way to store bookmarks, history and other personal meta data (like the names of your mail, instant messaging, and WebDav accounts) and a standard way to intercept personal data so that it can be indexed and searched, either locally or remotely. But who will build the glue?

I think this has to be a cross-platform open-source project. It has too much potential as a chokepoint to be entrusted to a commercial party. Inspired by Rohit Khare and Adam Rifkin’s ideas around Fisher, I think much of this could be achieved with a proxy server. It could proxy HTTP, IMAP, POP, LDAP, Jabber, etc., transparently indexing and caching things. One could connect to it as a http://localhost/ web server to search and configure things. Applications can contact it through HTTP to get their configuration data. It can expose web services APIs for all this too, so that native applications can be built for search, etc. If we could, e.g., get Mozilla and other desktop applications to look for the daemon on install, and, when it’s present, configure themselves through it by default, then all that one should need to do on a new machine is tell the daemon where on the web your personal configuration lives, and you’re good to go. With one step, your files, address book, bookmarks, cookies, logins, email configuration, etc. would all be there.

The daemon would mostly be a framework for plugins. For example, search needn’t be hard-wired into it, it should be a plugin. Different vendors might provide different personal search applications. Similarly, a spam-detectors could easily be plugged into the email processing pipeline, etc.

Does this sound plausable? Should we build it?