<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Free Search</title>
	<atom:link href="http://blog.lucene.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.lucene.com</link>
	<description>Ramblings about Lucene, Nutch, Hadoop &#38; other stuff</description>
	<lastBuildDate>Tue, 12 May 2009 20:25:48 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<image>
		<url>http://www.gravatar.com/blavatar/5ec00bd7ebb1927f420d45acf7758d81?s=96&#038;d=http://s.wordpress.com/i/buttonw-com.png</url>
		<title>Free Search</title>
		<link>http://blog.lucene.com</link>
	</image>
			<item>
		<title>Some early Avro benchmarks</title>
		<link>http://blog.lucene.com/2009/05/12/some-early-avro-bencharks/</link>
		<comments>http://blog.lucene.com/2009/05/12/some-early-avro-bencharks/#comments</comments>
		<pubDate>Tue, 12 May 2009 20:00:54 +0000</pubDate>
		<dc:creator>Doug Cutting</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[avro]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[protobuf]]></category>
		<category><![CDATA[thrift]]></category>

		<guid isPermaLink="false">http://blog.lucene.com/?p=77</guid>
		<description><![CDATA[Avro is my current project.  It&#8217;s a slightly different take on data serialization.
Most data serialization systems, like Thrift and Protocol Buffers, rely on code generation, which can be awkward with dynamic languages and datasets.  For example, many folks write MapReduce programs in languages like Pig and Python, and generate datasets whose schema is [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.lucene.com&blog=584950&post=77&subd=cutting&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p><a href="http://hadoop.apache.org/avro/">Avro</a> is my current project.  It&#8217;s a slightly different take on data serialization.</p>
<p>Most data serialization systems, like Thrift and Protocol Buffers, rely on code generation, which can be awkward with dynamic languages and datasets.  For example, many folks write MapReduce programs in languages like <a href="http://hadoop.apache.org/pig/">Pig</a> and <a href="http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant">Python</a>, and generate datasets whose schema is determined by the script that generates them.  One of the goals for Avro is to permit such applications to achieve high performance without forcing them to run external compilers.</p>
<p>A few early Avro benchmarks are now in.  A month ago, Johan Oskarsson (of Last.fm) <a href="http://blog.oskarsson.nu/2009/04/avro-serialization-follow-up.html">ran his serialization size benchmark</a> using Avro.  And today, Sharad Agarwal (my Avro collaborator) ran an <a href="http://code.google.com/p/thrift-protobuf-compare/">existing java serialization benchmark</a> using Avro, and the <a href="http://mail-archives.apache.org/mod_mbox/hadoop-avro-dev/200905.mbox/%3C2C52DBBEC4855C438BB330CB0D3B46590131C93D@SNV-EXVS01.ds.corp.yahoo.com%3E">initial results</a> look decent.  Curiously, Avro&#8217;s generic (no code generation) and specific (generated classes) APIs diverged significantly and unexpectedly despite sharing much of their implementation.  This suggests that both might be easily improved.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cutting.wordpress.com/77/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cutting.wordpress.com/77/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cutting.wordpress.com/77/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cutting.wordpress.com/77/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cutting.wordpress.com/77/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cutting.wordpress.com/77/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cutting.wordpress.com/77/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cutting.wordpress.com/77/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cutting.wordpress.com/77/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cutting.wordpress.com/77/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.lucene.com&blog=584950&post=77&subd=cutting&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://blog.lucene.com/2009/05/12/some-early-avro-bencharks/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/47a355bf58b0f6f57136bf90802bf333?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">cutting</media:title>
		</media:content>
	</item>
		<item>
		<title>Hadoop Sorts a Petabyte</title>
		<link>http://blog.lucene.com/2009/05/12/hadoop-sorts-a-petabyte/</link>
		<comments>http://blog.lucene.com/2009/05/12/hadoop-sorts-a-petabyte/#comments</comments>
		<pubDate>Tue, 12 May 2009 17:45:51 +0000</pubDate>
		<dc:creator>Doug Cutting</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.lucene.com/?p=72</guid>
		<description><![CDATA[Woot!  Owen and Arun have posted new Hadoop sort benchmark results.  This is a great milestone for both throughput (a petabyte in ~16 hours) and latency (a terabyte in ~1 minute).
       <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.lucene.com&blog=584950&post=72&subd=cutting&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Woot!  Owen and Arun have posted <a href="http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html">new Hadoop sort benchmark results</a>.  This is a great milestone for both throughput (a petabyte in ~16 hours) and latency (a terabyte in ~1 minute).</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cutting.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cutting.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cutting.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cutting.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cutting.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cutting.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cutting.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cutting.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cutting.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cutting.wordpress.com/72/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.lucene.com&blog=584950&post=72&subd=cutting&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://blog.lucene.com/2009/05/12/hadoop-sorts-a-petabyte/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/47a355bf58b0f6f57136bf90802bf333?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">cutting</media:title>
		</media:content>
	</item>
		<item>
		<title>Cloud: commodity or proprietary?</title>
		<link>http://blog.lucene.com/2008/04/09/cloud-commodity-or-proprietary/</link>
		<comments>http://blog.lucene.com/2008/04/09/cloud-commodity-or-proprietary/#comments</comments>
		<pubDate>Wed, 09 Apr 2008 16:45:04 +0000</pubDate>
		<dc:creator>Doug Cutting</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://cutting.wordpress.com/?p=71</guid>
		<description><![CDATA[A few days ago Google announced its App Engine, which lets folks build applications that run in Google&#8217;s cloud.  Amazon has for a while had a number of services to let folks run applications in Amazon&#8217;s cloud.  But in both of these cases, one must use their proprietary APIs.
For example, Google provides a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.lucene.com&blog=584950&post=71&subd=cutting&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>A few days ago Google announced its <a href="http://code.google.com/appengine/">App Engine</a>, which lets folks build applications that run in Google&#8217;s cloud.  Amazon has for a while had a number of <a href="http://aws.amazon.com/">services</a> to let folks run applications in Amazon&#8217;s cloud.  But in both of these cases, one must use their proprietary APIs.</p>
<p>For example, Google provides a <a href="http://code.google.com/appengine/docs/datastore/">datastore</a> API that applications must use to persist state, while Amazon similarly provides a <a href="http://aws.amazon.com/simpledb">simple DB</a> API.  Amazon&#8217;s services are generally lower-level and easier to adopt ala-carte, while Google provides one-stop-shopping.  Either way, one&#8217;s application code becomes dependent on a particular vendor.  This is in contrast to most web applications today, where, with things like the <a href="http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29">LAMP stack</a>, folks can build vendor-neutral applications from free (as in beer) parts and select from a competitive, commodity hosting market.</p>
<p>As we shift applications to the cloud, do we want our code to remain vendor-neutral?  Or would we rather work in silos, where some folks build things to run in the Google cloud, some for the Amazon cloud, and others for the Microsoft cloud?  Once an application becomes sufficiently complex, moving it from one cloud to another becomes difficult, placing folks at the mercy of their cloud provider.</p>
<p>I think most would prefer not to be locked-in, that cloud providers instead sold commodity services.  But how can we ensure that?</p>
<p>If we develop standard, non-proprietary cloud APIs with open-source implementations, then cloud providers can deploy these and compete on price, availability, performance, etc., giving developers usable alternatives.  But such APIs won&#8217;t be developed by the cloud providers.  They have every incentive to develop proprietary APIs in order to lock folks into their services.  Good open-source implementations will only come about if the community makes them a priority and builds them.</p>
<p><a href="http://hadoop.apache.org/">Hadoop</a> is a big initial step in this direction.  Its current focus is on batch computing, but several of its components are also key to cloud hosting.  <a href="http://hadoop.apache.org/core/docs/current/hdfs_design.html">HDFS</a> provides a scalable, distributed filesystem.  It doesn&#8217;t yet meet the high-availability requirements of cloud hosting, but once folks who need that help to build it, it will.  <a href="http://hadoop.apache.org/hbase/">HBase</a> provides a database comparable to Amazon&#8217;s Simple DB and Google&#8217;s Datastore API.  It&#8217;s still young, but, if folks want, it could become a solid competitor to these.</p>
<p>Moral: if you want commodity cloud hosting, pitch in now.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/cutting.wordpress.com/71/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/cutting.wordpress.com/71/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cutting.wordpress.com/71/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cutting.wordpress.com/71/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cutting.wordpress.com/71/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cutting.wordpress.com/71/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cutting.wordpress.com/71/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cutting.wordpress.com/71/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cutting.wordpress.com/71/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cutting.wordpress.com/71/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cutting.wordpress.com/71/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cutting.wordpress.com/71/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.lucene.com&blog=584950&post=71&subd=cutting&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://blog.lucene.com/2008/04/09/cloud-commodity-or-proprietary/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/47a355bf58b0f6f57136bf90802bf333?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">cutting</media:title>
		</media:content>
	</item>
		<item>
		<title>MapReduce cookbook for machine learning</title>
		<link>http://blog.lucene.com/2007/07/30/mapreduce-cookbook-for-machine-learning/</link>
		<comments>http://blog.lucene.com/2007/07/30/mapreduce-cookbook-for-machine-learning/#comments</comments>
		<pubDate>Mon, 30 Jul 2007 22:15:31 +0000</pubDate>
		<dc:creator>Doug Cutting</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.lucene.com/2007/07/30/mapreduce-cookbook-for-machine-learning/</guid>
		<description><![CDATA[Here&#8217;s a paper from Stanford showing how to use MapReduce to scalably implement ten different machine learning algorithms!
       <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.lucene.com&blog=584950&post=70&subd=cutting&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Here&#8217;s a <a href="http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf">paper from Stanford</a> showing how to use MapReduce to scalably implement <strong>ten</strong> different machine learning algorithms!</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/cutting.wordpress.com/70/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/cutting.wordpress.com/70/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cutting.wordpress.com/70/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cutting.wordpress.com/70/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cutting.wordpress.com/70/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cutting.wordpress.com/70/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cutting.wordpress.com/70/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cutting.wordpress.com/70/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cutting.wordpress.com/70/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cutting.wordpress.com/70/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cutting.wordpress.com/70/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cutting.wordpress.com/70/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.lucene.com&blog=584950&post=70&subd=cutting&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://blog.lucene.com/2007/07/30/mapreduce-cookbook-for-machine-learning/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/47a355bf58b0f6f57136bf90802bf333?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">cutting</media:title>
		</media:content>
	</item>
		<item>
		<title>Scale-up versus Scale-out</title>
		<link>http://blog.lucene.com/2007/07/30/scale-up-versus-scale-out/</link>
		<comments>http://blog.lucene.com/2007/07/30/scale-up-versus-scale-out/#comments</comments>
		<pubDate>Mon, 30 Jul 2007 21:39:22 +0000</pubDate>
		<dc:creator>Doug Cutting</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.lucene.com/2007/07/30/scale-up-versus-scale-out/</guid>
		<description><![CDATA[I just ran across a paper from IBM comparing scaling-up (using bigger boxes) to scaling-out (using more boxes).  They use Nutch search as their workload, and conclude &#8220;&#8230; that scale-out solutions have an indisputable performance and price/performance advantage over scale-up for search workloads.&#8221;  Not exactly a big surprise, but it&#8217;s good to have [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.lucene.com&blog=584950&post=69&subd=cutting&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I just ran across a <a href="http://www.cecs.uci.edu/~papers/ipdps07/pdfs/SMTPS-201-paper-1.pdf">paper from IBM</a> comparing scaling-up (using bigger boxes) to scaling-out (using more boxes).  They use Nutch search as their workload, and conclude &#8220;&#8230; that scale-out solutions have an indisputable performance and price/performance advantage over scale-up for search workloads.&#8221;  Not exactly a big surprise, but it&#8217;s good to have objective data.  They also conclude that &#8220;Scale-out systems are still in a significant disadvantage with respect to scale-up when it comes to systems management.&#8221;  Hmm.  With frameworks like <a href="http://lucene.apache.org/hadoop/">Hadoop</a>, folks shouldn&#8217;t be bothered as much by the more frequent host failures that a scale-out system is prone to.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/cutting.wordpress.com/69/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/cutting.wordpress.com/69/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cutting.wordpress.com/69/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cutting.wordpress.com/69/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cutting.wordpress.com/69/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cutting.wordpress.com/69/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cutting.wordpress.com/69/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cutting.wordpress.com/69/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cutting.wordpress.com/69/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cutting.wordpress.com/69/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cutting.wordpress.com/69/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cutting.wordpress.com/69/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.lucene.com&blog=584950&post=69&subd=cutting&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://blog.lucene.com/2007/07/30/scale-up-versus-scale-out/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/47a355bf58b0f6f57136bf90802bf333?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">cutting</media:title>
		</media:content>
	</item>
	</channel>
</rss>