<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>tomkleinpeter.com &#187; Toolbox</title>
	<atom:link href="http://www.tomkleinpeter.com/category/toolbox/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.tomkleinpeter.com</link>
	<description></description>
	<lastBuildDate>Sat, 23 Jan 2010 18:32:48 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Programmer&#8217;s Toolbox Part 3: Consistent Hashing</title>
		<link>http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/</link>
		<comments>http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/#comments</comments>
		<pubDate>Mon, 17 Mar 2008 20:32:11 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Toolbox]]></category>
		<category><![CDATA[consistent hashing]]></category>
		<category><![CDATA[dynamo]]></category>
		<category><![CDATA[memcached]]></category>
		<category><![CDATA[partitioning]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/</guid>
		<description><![CDATA[Next up in the toolbox series is an idea so good it deserves an entire article all to itself: consistent hashing.  
Let’s say you&#8217;re a hot startup and your database is starting to slow down.  You decide to cache some results so that you can render web pages more quickly.  If you [...]]]></description>
			<content:encoded><![CDATA[<p>Next up in the toolbox series is an idea so good it deserves an entire article all to itself: consistent hashing.  </p>
<p>Let’s say you&#8217;re a hot startup and your database is starting to slow down.  You decide to cache some results so that you can render web pages more quickly.  If you want your cache to use multiple servers (scale horizontally, in the biz), you’ll need some way of picking the right server for a particular key.  If you only have 5 to 10 minutes allocated for this problem on your development schedule, you’ll end up using what is known as the naïve solution: put your N server IPs in an array and pick one using key % N. </p>
<p>I kid, I kid &#8212; I know you don&#8217;t have a development schedule.  That&#8217;s OK.  You&#8217;re a startup.</p>
<p>Anyway, this ultra simple solution has some nice characteristics and may be the right thing to do.  But your first major problem with it is that as soon as you add a server and change N, most of your cache will become invalid.  Your databases will wail and gnash their teeth as practically everything has to be pulled out of the DB and stuck back into the cache.  If you’ve got a popular site, what this really means is that someone is going to have to wait until 3am to add servers because that is the only time you can handle having a busted cache.  Poor Asia and Europe &#8212; always getting screwed by late night server administration.</p>
<p>You’ll have a second problem if your cache is read-through or you have some sort of processing occurring alongside your cached data.  What happens if one of your cache servers fails?  Do you just fail the requests that should have used that server?  Do you dynamically change N?  In either case, I recommend you save the angriest twitters about your site being down.  One day you&#8217;ll look back and laugh.  One day.</p>
<p>As I said, though, that might be OK.  You may be trying to crank this whole project out over the weekend and simply not have time for a better solution.  That is how I wrote the caching layer for Audiogalaxy searches, and that turned out OK.  The caching part, at least.  But if had known about it at the time, I would have started with a simple version of consistent hashing.  It isn’t that much more complicated to implement and it gives you a lot of flexibility down the road. </p>
<p>The technical aspects of consistent hashing have been well explained in other places, and you’re crazy and negligent if you use this as your only reference.  But, I’ll try to do my best.  Consistent hashing is a technique that lets you smoothly handle these problems:</p>
<ul>
<li>Given a resource key and a list of servers, how do you find a primary, second, tertiary (and on down the line) server for the resource?</li>
<li>If you have different size servers, how do you assign each of them an amount of work that corresponds to their capacity?</li>
<li>How do you smoothly add capacity to the system without downtime?  Specifically, this means solving two problems:
<ul>
<li>How do you avoid dumping 1/N of the total load on a new server as soon as you turn it on?</li>
<li>How do you avoid rehashing more existing keys than necessary?</li>
</ul>
</li>
</ul>
<p>In a nutshell, here is how it works.  Imagine a 64-bit space.  For bonus points, visualize it as a ring, or a clock face.  Sure, this will make it more complicated when you try to explain it to your boss, but bear with me:</p>
<p><center><img src='http://www.spiteful.com/wp-content/uploads/2008/03/consistent_hashing_simple.png' alt='consistent_hashing_simple.png' /></center></p>
<p>That part isn’t very complicated.  </p>
<p>Now imagine hashing resources into points on the circle. They could be URLs, GUIDs, integer IDs, or any arbitrary sequence of bytes.  Just run them through MD5 or SHA and shave off everything but 8 bytes (and if anyone tells you that you shouldn’t use MD5 for this because it isn’t secure, just nod and back away slowly.  You have identified someone not worth arguing with).  Now, take those freshly minted 64-bit numbers and stick them onto the circle:</p>
<p><center><img src='http://www.spiteful.com/wp-content/uploads/2008/03/consistent_hashing_resources.png' alt='consistent_hashing_resources.png' /></center></p>
<p>Finally, imagine your servers.  Imagine that you take your first server and create a string by appending the number 1 to its IP. Let&#8217;s call that string IP1-1.  Next, imagine you have a second server that has twice as much memory as server 1.  Start with server #2’s IP, and create 2 strings from it by appending 1 for the first one and 2 for the second one.  Call those strings IP2-1 and IP2-2.  Finally, imagine you have a third server that is exactly the same as your first server, and create the string IP3-1.  Now, take all those strings, hash them into 64-bit numbers, and stick them on the circle with your resources:</p>
<p><center><img src='http://www.spiteful.com/wp-content/uploads/2008/03/consistent_hashing_full.png' alt='consistent_hashing_full.png' /></center></p>
<p>Can you see where this is headed?  You have just solved the problem of which server to use for resource A.  You start where resource A is and head clockwise on the ring until you hit a server.  If that server is down, you go to the next one, and so on and so forth.  In practice, you’ll want to use more than 1 or 2 points for each server, but I’ll leave those details as an exercise for you, dear reader.</p>
<p>Now, allow me to use bullet points to explain how cool this is:</p>
<ul>
<li>Assuming you’ve used a lot more than 1 point per server, when one server goes down, every other server will get a share of the new load.  In the case above, imagine what happens when server #2 goes down.  Resource A shifts to server #1, and resource B shifts to server #3 (Note that this won’t help if all of your servers are already at 100% capacity.  Call your VC and ask for more funding).</li>
<li>You can tune the amount of load you send to each server based on that server’s capacity.  Imagine this spatially – more points for a server means it covers more of the ring and is more likely to get more resources assigned to it.
<p>You could have a process try to tune this load dynamically, but be aware that you&#8217;ll be stepping close to problems that control theory was built to solve.  Control theory is more complicated than consistent hashing.</li>
<li>If you store your server list in a database (2 columns: IP address and number of points), you can bring servers online slowly by gradually increasing the number of points they use.  This is particularly important for services that are disk bound and need time for the kernel to fill up its caches.  This is one way to deal with the datacenter variant of the <a href="http://en.wikipedia.org/wiki/Thundering_herd_problem">Thundering Herd Problem</a>.
<p>Here I go again with the control theory &#8212; you <em>could</em> do this automatically.  But adding capacity usually happens so rarely that just having somebody sitting there watching top and running SQL updates is probably fine.  Of course, EC2 changes everything, so maybe you’ll be hitting the books after all.</li>
<li>If you are really clever, when everything is running smoothly you can go ahead and pay the cost of storing items on both their primary and secondary cache servers.  That way, when one server goes down, you’ve probably got a backup cache ready to go.</li>
</ul>
<p>Pretty cool, eh?</p>
<p>I want to hammer on point #4 for a minute.  If you are building a big system, you really need to consider what happens when machines fail.  If the answer is “we crush the databases,” congratulations: you will get to observe a cascading failure.  I love this stuff, so hearing about cascading failures makes me smile.  But it won’t have the same effect on your users.  </p>
<p>Finally, you may not know this, but you use consistent hashing every time you put something in your cart at Amazon.com.  Their <a href="http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf">massively scalable data store, Dynamo</a>, uses this technique.  Or if you use Last.fm, you’ve used a great combination: consistent hashing + memcached.  They were kind enough to <a href="http://www.last.fm/user/RJ/journal/2007/04/10/392555/">release their changes</a>, so if you are using memcached, you can just use their code without dealing with these messy details.  But keep in mind that there are more applications to this idea than just simple caching.  Consistent hashing is a powerful idea for anyone building services that have to scale across a group of computers.</p>
<p>A few more links:</p>
<ul>
<li><a href="http://www.lexemetech.com/2007/11/consistent-hashing.html">Another blog post about consistent hashing</a></li>
<li><a href="http://citeseer.ist.psu.edu/karger97consistent.html">The original paper</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>5 more essentials for your programming toolbox</title>
		<link>http://www.tomkleinpeter.com/2008/02/25/5-more-essentials-for-your-programming-toolbox/</link>
		<comments>http://www.tomkleinpeter.com/2008/02/25/5-more-essentials-for-your-programming-toolbox/#comments</comments>
		<pubDate>Mon, 25 Feb 2008 22:56:49 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Toolbox]]></category>
		<category><![CDATA[Berkeley DB]]></category>
		<category><![CDATA[Skip Lists]]></category>
		<category><![CDATA[Soft Deletes]]></category>
		<category><![CDATA[Unrolled Linked Lists]]></category>
		<category><![CDATA[VEB Trees]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/2008/02/25/5-more-essentials-for-your-programming-toolbox/</guid>
		<description><![CDATA[Following up on my first post, What should be in your programming toolbox, here are a few more ideas from my list:
Unrolled Linked Lists
Wikipedia has more details, but essentially, an unrolled linked list is great because it will give you better performance than a regular linked list, is more cache friendly, and will probably have [...]]]></description>
			<content:encoded><![CDATA[<p>Following up on my first post, <a href="http://www.spiteful.com/2008/02/23/what-should-be-in-your-programming-toolbox/">What should be in your programming toolbox</a>, here are a few more ideas from my list:</p>
<p><a name="Unrolled_Linked_Lists"><strong>Unrolled Linked Lists</strong></a></p>
<p>Wikipedia has <a href="http://en.wikipedia.org/wiki/Unrolled_linked_list">more details</a>, but essentially, an unrolled linked list is great because it will give you better performance than a regular linked list, is more cache friendly, and will probably have much less overhead.  The basic idea is to store an array of elements at each node rather than a single one.  This keeps your pointers closer together, which will make your cache happy when you are iterating through your items.  <a href="http://en.wikipedia.org/wiki/Unrolled_linked_list"></a></p>
<p>To compare the overhead for an unrolled list, think for a second about a regular linked list node.  It probably looks something like this:</p>
<p><img src="http://www.spiteful.com/wp-content/uploads/2008/02/simple.png" alt="simple.png" /></p>
<p>Assuming you&#8217;ve got 4 byte pointers, each node is going to take 8 bytes.  But the allocation overhead for the node could be anywhere between 8 and 16 bytes.  Let&#8217;s go with the best case and assume it will be 8 bytes.  So, if you want to store 1K items in this list, you are going to have 16KB of overhead.</p>
<p>Now, let&#8217;s think about an unrolled linked list node.  It will look something like this:</p>
<p><img src="http://www.spiteful.com/wp-content/uploads/2008/02/unrolled.png" alt="unrolled.png" /></p>
<p>Therefore, allocating a single node (12 bytes + 8 bytes of overhead) with an array of 100 elements (400 bytes + 8 bytes of overhead) will now cost 428 bytes, or 4.28 bytes per element.  Thinking about our 1K items from above, it would take about 4.2KB of overhead, ,which is close to 4x better than our original list.  Even if the list becomes severely fragmented and the item arrays are only 1/2 full on average, this is still an improvement.  Also, note that you can tune the array size to whatever gets you the best overhead for your application.</p>
<p><a name="Van_Emde_Boas_Trees"><strong>Van Emde Boas Trees</strong></a></p>
<p>I&#8217;ve used unrolled link lists before, but I&#8217;ve never used this slightly more exotic structure.  I&#8217;m just very fond of low overhead data structures that are built for storing large amounts of information, and this is a clever one. VEB trees have good overhead characteristics, but they are also particularly fast &#8212; they implement all operations in O(log m), where m is the key length.  So, if you use an optimal key length for your number of items, you can do everything in O(log log n), which is obviously an improvement over your standard O(log n) binary tree.   The Wikipedia page <a href="http://en.wikipedia.org/wiki/Van_Emde_Boas_tree">has more details</a>, but sadly, I haven&#8217;t been able to find any sort of library implementation.</p>
<p><a name="Soft_Deletes"><strong>Soft Deletes</strong></a></p>
<p>If you want to be extra cautious about removing items from a database or you just want to support undelete, consider using Soft Deletes for your databases.  Just add a &#8220;TimeDeleted&#8221; column and consider the row deleted when the time is non-zero.  You can use a clean up script to actually remove the item after some amount of time has passed.</p>
<p>This technique isn&#8217;t without some hassles, though &#8212; you may have to add some extra logic or include the TimeDeleted column as part of a unique index if you want to allow the user to insert a duplicate item back into the table.</p>
<p><a name="Skip_Lists"><strong>Skip Lists</strong></a></p>
<p>I&#8217;m not sure if I would ever actually use skip lists because it seems like a cache unfriendly data structure.  However,  I still like <a href="http://en.wikipedia.org/wiki/Skip_list">the idea</a>. In a nutshell, you start with a normal, sorted linked list.  Then, on some nodes you add pointers that let you skip forward more than one node at a time.  With a proper distribution, this allows for much better than O(N) when searching, inserting, and deleting items.</p>
<p><a name="Berkeley_DB"><strong>Berkeley DB </strong></a></p>
<p>If you need to store some structured information on a disk, you need to have a really good reason to implement your own.  Using <a href="http://en.wikipedia.org/wiki/Berkeley_DB">Berkeley DB</a> gets you a simple, well-known and well-debugged disk-backed store without a lot of overhead.  Plus, you get things like hot backups and locking.</p>
<p>Note that Amazon is using Berkeley DB as one of the underlying stores for <a href="http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf">Dynamo</a>.  If you haven&#8217;t read the Dynamo paper (or for that matter, the <a href="http://labs.google.com/papers/bigtable.html">BigTable</a> and <a href="http://research.google.com/archive/gfs-sosp2003.pdf">GFS </a>papers), you really should.  I didn&#8217;t mention it before, but BigTable uses one of the ideas from my <a href="http://www.spiteful.com/2008/02/23/what-should-be-in-your-programming-toolbox/">last post</a>, Bloom Filters.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/02/25/5-more-essentials-for-your-programming-toolbox/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>What should be in your programming toolbox?</title>
		<link>http://www.tomkleinpeter.com/2008/02/23/what-should-be-in-your-programming-toolbox/</link>
		<comments>http://www.tomkleinpeter.com/2008/02/23/what-should-be-in-your-programming-toolbox/#comments</comments>
		<pubDate>Sun, 24 Feb 2008 05:51:24 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Toolbox]]></category>
		<category><![CDATA[Bloom Filters]]></category>
		<category><![CDATA[Message Passing]]></category>
		<category><![CDATA[MS Manners]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/2008/02/23/what-should-be-in-your-programming-toolbox/</guid>
		<description><![CDATA[I really enjoy writing the code that makes systems like Audiogalaxy and FolderShare run.  Getting into the zone and really getting some good work done is a great experience, but remains my second favorite aspect of the job.  For me, the best part is the design phase before the real coding starts.  [...]]]></description>
			<content:encoded><![CDATA[<p>I really enjoy writing the code that makes systems like Audiogalaxy and FolderShare run.  Getting into the zone and really getting some good work done is a great experience, but remains my second favorite aspect of the job.  For me, the best part is the design phase before the real coding starts.  At that point, everything is totally fluid and malleable.  I&#8217;m making the decisions that I&#8217;m going to live with for the next few years, and putting some extra cleverness or flexibility into the system can have huge payoffs.</p>
<p>Something I&#8217;ve found very helpful at this phase is a &#8220;programming toolbox&#8221; &#8212; a simple list of good ideas and approaches to different problems.  When I&#8217;m stuck on a problem, or trying to generate a new approach to something, it can be helpful to flip through the list.  Most of the ideas won&#8217;t apply, but sometimes it will spark something novel.  To keep this list from getting too unwieldy, I&#8217;ll post a few at a time as I write them up.</p>
<p><a name="Bloom_Filters"><strong>Bloom Filters</strong></a><br />
Let&#8217;s start with one of my favorite data structures.  <a href="http://en.wikipedia.org/wiki/Bloom_filter" target="_blank">Wikipedia</a> has all the details, but here are the key points you need to know.  Bloom filters are a &#8220;space-efficient <span class="mw-redirect">probabilistic</span> data structure&#8221; that:</p>
<ul>
<li>Will answer either &#8220;maybe&#8221; or &#8220;no&#8221; to the question, &#8220;Have you seen item X?&#8221;</li>
<li>Don&#8217;t store the actual item</li>
<li>Are incredibly space efficient &#8212; it takes about 10 bits per stored item to have a 1% error rate</li>
<li>Are tunable &#8212; the error rate drops 10x for every extra 5 bits you use</li>
</ul>
<p>Just as an example, let&#8217;s imagine you want to want to cache copies of web pages on your disk.  Before taking the hit to access the disk to check if something is cached, you could use a Bloom Filter to keep track of what you have.  Storing 100K URLs in a bloom filter would use just 125KB of memory.  By contrast, if you stored UTF-8 versions of them in a hash table, assuming URLs are about 50 characters long, you would need about 4.7MB of memory, not even counting hash table overhead.  What is the downside?  About 1% of the time, you will get an incorrect &#8220;Maybe&#8221; from the Bloom Filter and would go to disk looking for something that does not exist.  If 1% is too much, you can double the memory usage to 250KB and the error rate drops to 0.01%.  Note that if the filter tells you &#8220;No&#8221;, there is a 0% chance that the URL is in the system.</p>
<p><a name="Generic_Headers"><strong>Generic Headers</strong></a><br />
Are you defining any sort of message passing layer in your system?  One thing you might want to consider is allowing unspecified header/value pairs to be added to the messages at any point later.  Anything reading a message should simply ignore any headers it doesn&#8217;t recognize.  This flexibility is baked into HTTP and has allowed all sorts of wonderful extensions beyond the basic HTTP spec.  Adding this flexibility to your message layer from the beginning is trivial and it might let you solve a difficult problem later without completely upgrading all of your services.</p>
<p><a name="MS_Manners"><strong>MS Manners</strong></a><br />
Officially &#8220;Progress-based regulation of low-importance processes&#8221;, but known as MS Manners in <a href="http://research.microsoft.com/research/pubs/view.aspx?id=737&amp;type=Publication" target="_blank">this paper</a> out of MS Research.  The basic idea is that a background process can easily detect its impact on the user (or a more important process) if a few conditions are met.  For processes with some degree of contention, the background process can establish a baseline of how long work should take.  If it detects that recent work is taking longer, it can infer that it is impacting the more important process and wait until later to proceed.  It sounds pretty simple, but the researchers have done the hard work in the paper and shown how to squeeze out the most gains.  I think this is a powerful idea worth incorporating into any sort of maintenance job that runs against a production database.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/02/23/what-should-be-in-your-programming-toolbox/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>
