I am presently giving a lecture on Lucene at the University of Pisa. At the end of the presentation I’ll search for this blog entry, hoping to find it on Technorati, which uses Lucene, thus demonstrating live incremental indexing.
Archive for November, 2004
Lucene lecture
November 23, 2004Lucene lecture
November 23, 2004I am presently giving a lecture on Lucene at the University of Pisa. At the end of the presentation I’ll search for this blog entry, hoping to find it on Technorati, which uses Lucene, thus demonstrating live incremental indexing.
Dynamization and Lucene
November 20, 2004Paolo Ferragina, my host in Pisa, pointed out to me some theoretical underpinnings for Lucene’s indexing method.
Internally, Lucene keeps a stack of segment indexes. Their size increases logarithmically down the stack. New documents are added as tiny indexes pushed onto the top of the stack. When there are b indexes on the top of the stack that are the same size, they are all popped off the stack and merged into a single index which is pushed back onto the stack. Thus the number of segments of each size resembles the digits in the base-b representation of the total number of documents. For example, if b=10 and there are 234 documents indexed, then there would be a four one-document indexes on the top of the stack, followed by three ten-document indexes and finally two one-hundred document indexes.
This keeps the incremental cost of adding a document fairly low, while keeping the number of indexes to be searched fairly small. It’s also still optimal for batch indexing, since adding n tokens still only requires order n*log_b(n) comparisons, as good as any other sorting algorithm.
Apparently this is all theoretically justified by Mark Overmars‘ dynamization work, done around 1980. This paper might be the appropriate reference.
Dynamization and Lucene
November 20, 2004Paolo Ferragina, my host in Pisa, pointed out to me some theoretical underpinnings for Lucene’s indexing method.
Internally, Lucene keeps a stack of segment indexes. Their size increases logarithmically down the stack. New documents are added as tiny indexes pushed onto the top of the stack. When there are b indexes on the top of the stack that are the same size, they are all popped off the stack and merged into a single index which is pushed back onto the stack. Thus the number of segments of each size resembles the digits in the base-b representation of the total number of documents. For example, if b=10 and there are 234 documents indexed, then there would be a four one-document indexes on the top of the stack, followed by three ten-document indexes and finally two one-hundred document indexes.
This keeps the incremental cost of adding a document fairly low, while keeping the number of indexes to be searched fairly small. It’s also still optimal for batch indexing, since adding n tokens still only requires order n*log_b(n) comparisons, as good as any other sorting algorithm.
Apparently this is all theoretically justified by Mark Overmars‘ dynamization work, done around 1980. This paper might be the appropriate reference.
