Archive for May, 2004

Study questions Google’s long-term dominance

May 25, 2004

More evidence of the commodification of search. If search engines all have the same quality of results, then why waste effort competing on quality? Why not collaborate on that, and compete elsewhere?

Wizards of OS, Berlin

May 25, 2004

I’ll be at Wizards of OS 3 in Berlin, 10-12 June. I’m speaking about Nutch on 10 June, but will be in Berlin through the 12th. Anyone care to meet me for beers or dinner?

www 2004

May 22, 2004

Met lots of folks at WWW 2004, including:

  • Torsten Suel, who has done some great work on search optimization that Nutch/Lucene should adopt, among other things;
  • Giuseppe Attardi, who showed me some impressive benchmarks of a fetcher that uses async i/o;
  • Marc Najork, who wrote Mercator, a very extensible crawler that Nutch can learn from;
  • … and lots of other folks whose names I cannot recall.

Someone suggested that Nutch should look at Lustre for our robust, distributed filesystem needs. Does anyone have any experience with Lustre?

Thanks to Rohit Khare for inviting me!

beagle, gcj

May 21, 2004

Nat Friedman reports:

Jon Trowbridge’s Lucene-based indexer, beagle is really starting to kick ass. With a single dialog you can now search your files, you mail, your addressbook, and Google, and all the results get returned in an aggregated view. We have filters for text, html and OpenOffice files so far; adding PDF and Microsoft Office filters would be a nice project for someone.

Note that most of the Lucene-based personal search thingys are not using a JVM. Several are using the C# port of Lucene (Lookout, beagle), and Seruku translates Java to native code. I find this latter approach pretty interesting. GCJ in GCC 3.4 runs Lucene very quickly. But I don’t know how small of a Win32 app you can build with GCJ. Has anyone played with that?

web search is a commodity

May 20, 2004

Word has it that huge advances in web search are just around the corner, and that the big web search engines have lots of fancy secrets that make their search better. Folks speak of web search as a specialty product, not a commodity.

Bah! Web search is a commodity. Google hasn’t fundamentally altered its search engine in over five years. If companies have big new web search tricks up their sleeves, don’t you think we’d have seen them by now? And Yahoo! was able to more-or-less clone Google’s web search in fairly short order.

Google is more like Coca-Cola than like Eminem. Coca-Cola has a strong brand because of great marketing and operations. Like Google, Coke would like you to think that its secret recipe is its key to success, but fans of RC-Cola know better. Eminem can’t be so easily cloned. Yes, he has good marketing and operations, but he also has a dark secret in his soul that makes him harder to replace.