<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>tomkleinpeter.com &#187; Uncategorized</title>
	<atom:link href="http://www.tomkleinpeter.com/category/uncategorized/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.tomkleinpeter.com</link>
	<description></description>
	<lastBuildDate>Sat, 23 Jan 2010 18:32:48 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Where Are the AB Testing Frameworks?</title>
		<link>http://www.tomkleinpeter.com/2009/01/21/where-are-the-ab-testing-frameworks/</link>
		<comments>http://www.tomkleinpeter.com/2009/01/21/where-are-the-ab-testing-frameworks/#comments</comments>
		<pubDate>Wed, 21 Jan 2009 23:18:49 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=69</guid>
		<description><![CDATA[I read news.yc and reddit/programming pretty regularly to keep up with what is going on in the biz.  Based on that reading, I can probably name a dozen different systems for building high scale applications (distributed storage, message queues, caching layers, search engines, etc), but I can&#8217;t name a single AB testing framework other [...]]]></description>
			<content:encoded><![CDATA[<p>I read <a href="http://news.ycombinator.com">news.yc</a> and <a href="http://www.reddit.com/r/programming">reddit/programming</a> pretty regularly to keep up with what is going on in the biz.  Based on that reading, I can probably name a dozen different systems for building high scale applications (distributed storage, message queues, caching layers, search engines, etc), but I can&#8217;t name a single AB testing framework other than <a href="https://www.google.com/analytics/siteopt">Google Website Optimizer</a>.  That seems like a serious inversion of priorities for most startups.  Everyone with a sign up page should use AB testing.  Not everyone needs a message queue.</p>
<p>Is this because:
<ul>
<li>Nobody needs anything other than Google Website Optimizer?</li>
<li>Startups don&#8217;t actually do AB testing, possibly because they don&#8217;t get enough traffic to get meaningful results, or maybe because they don&#8217;t have time?</li>
<li>AB testing (including the statistical analysis to determine if results are valid) is so simple that everyone just bangs out their own?</li>
<li>As a largely theoretical issue for most startups, scalability is more fun to talk about on the Internet?</li>
<li>Everyone that is using AB testing is so happy that they are trying to suppress information about it so their competitors don&#8217;t start doing it too?</li>
</ul>
<p>If everyone is secretly using some great framework please shoot me an email and let me know.</p>
<p>If you haven&#8217;t thought much about it before, here is <a href="http://exp-platform.com/Documents/GuideControlledExperiments.pdf">a short paper on AB testing</a> from some folks that made Amazon a ton of money.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2009/01/21/where-are-the-ab-testing-frameworks/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Two and a Half Months of Twitter</title>
		<link>http://www.tomkleinpeter.com/2008/09/20/two-and-a-half-months-of-twitter/</link>
		<comments>http://www.tomkleinpeter.com/2008/09/20/two-and-a-half-months-of-twitter/#comments</comments>
		<pubDate>Sat, 20 Sep 2008 20:10:44 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=68</guid>
		<description><![CDATA[After a few months of playing around with Twitter, the service is really growing on me.  The ability to have casual IM-ish conversations without any immediacy is nice.  Also, having a place to record short thoughts and interesting links that other people might like scratches some sort of itch for me.  I [...]]]></description>
			<content:encoded><![CDATA[<p>After a few months of playing around with <a href="http://www.twitter.com/tklein">Twitter</a>, the service is really growing on me.  The ability to have casual IM-ish conversations without any immediacy is nice.  Also, having a place to record short thoughts and interesting links that other people might like scratches some sort of itch for me.  I wouldn&#8217;t want to write up a whole blog post for any of these, but they were all interesting enough to post on twitter:</p>
<ul>
<li>A clever proposal from Google: <a href="http://groups.google.com/group/SDCH">Shared Dictionary Compression over HTTP</li>
<li><a href="http://technet.microsoft.com/en-us/sysinternals/bb897561.aspx">Cacheset</a> &#8211; a tool for clearing the windows disk cache (useful for testing cold starts).</li>
<li>Fun fact: the Tesla Roadster carries <a href="http://www.teslamotors.com/blog4/?p=68">3 milligrams of electrons</a> when fully charged.</li>
<li>The ultimate Airplane on a Treadmill debate resource: <a href="http://www.airplaneonatreadmill.com/">www.airplaneonatreadmill.com</a></li>
<li>A 728-ton <a href="http://blog.longnow.org/2008/06/25/728-ton-pendulum/">tuned mass damper</a> in a skyscraper</li>
</ul>
<p>But, I don&#8217;t think I&#8217;ve reached the critical mass of followers necessary to really unlock the Q&#038;A potential of the site.  Having a few hundred technical folks all following each other would be a tremendously useful resource for everyone involved.  For example, I&#8217;m considering upgrading my desktop to 8 or 16GB of RAM.  I&#8217;m going to need a new motherboard, processor, and RAM.  My normal approach for this would be to spend a few hours on Newegg and the hardware review sites trying to figure out where the price/performance curve is and making sure I&#8217;m not getting ripped off.  If someone else has done this same research it would be nice to use their information as a starting point, and twitter provides the kind of free-form conversation necessary for that kind of sharing.  </p>
<p>To really make this work, you need to run one of the desktop apps so you don&#8217;t have to constantly reload the website (I use <a href="http://www.twhirl.org/">Twhirl</a>). </p>
]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/09/20/two-and-a-half-months-of-twitter/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Handling Human Error In the Datacenter</title>
		<link>http://www.tomkleinpeter.com/2008/08/11/handling-human-error-in-the-datacenter/</link>
		<comments>http://www.tomkleinpeter.com/2008/08/11/handling-human-error-in-the-datacenter/#comments</comments>
		<pubDate>Mon, 11 Aug 2008 19:21:19 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[startups]]></category>
		<category><![CDATA[uptime]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=64</guid>
		<description><![CDATA[When I was working on Live Mesh at Microsoft, I had the good fortune to meet James Hamilton.  James is full of good ideas, many of which are captured in his paper “On Designing and Deploying Internet-Scale Services.”  There is a lot of wisdom in those pages (Greg Linden had some thoughts on [...]]]></description>
			<content:encoded><![CDATA[<p>When I was working on Live Mesh at Microsoft, I had the good fortune to meet <a href="http://perspectives.mvdirona.com/">James Hamilton</a>.  James is full of good ideas, many of which are captured in his paper <a href="http://research.microsoft.com/~jamesrh/TalksAndPapers/JamesRH_Lisa.pdf">“On Designing and Deploying Internet-Scale Services.”</a>  There is a lot of wisdom in those pages (Greg Linden had <a href="http://glinden.blogspot.com/2008/03/designing-for-internet-scale.html">some thoughts on it</a>), but I’d like to focus in on this snippet in particular:</p>
<blockquote><p>Design the system to never need human interaction, but understand that rare events will occur where combined failures or unanticipated failures require human interaction. </p></blockquote>
<p>Yes, designing the system to never need human interaction is a <a href="http://www.25hoursaday.com/weblog/2008/08/11/ManagingLargeWebServerFarmsMicrosoftsAutoPilot.aspx">great ideal to shoot for</a>, but when you are working for a startup with three guys and a dozen servers, you don’t have the resources or the justification to do it from Day 1.  It is entirely likely that your business model will <a href="http://teddziuba.com/2008/04/im-going-to-scale-my-foot-up-y.html">fail</a> before you lose a single disk.  And since backend refinements don’t pay the bills at a small scale, something with a pair of hands is going to be interacting with your system until you get enough people and servers to justify more automation.  </p>
<blockquote><p>These events will happen and operator error under these circumstances is a common source of catastrophic data loss.</p></blockquote>
<p>That is a wonderfully simple and accurate summary of How Bad Things Happen in your datacenter.  It starts when you lose a hard drive or MySQL crashes, and you have to promote your slave until you can check the master tables, or anything painful but routine happens.  But then, as you are trying to fix things, you notice, for example, that you are almost out of disk space.  When you start trying to fix more than one problem under pressure, you are entering a world of pain.  </p>
<p>The big issue here as James points out is that you are going to do something wrong.  You’ll probably use a much stronger word than “wrong” once it is all over, but let’s settle on “stupid” for right now.  </p>
<p>It won&#8217;t feel stupid until after you hit “enter,” but when you are making unfamiliar decisions quickly under pressure, you are extremely likely to overlook something.  Maybe you won&#8217;t shut down mysql before you start a myisamchk from the shell, or maybe you&#8217;ll reverse the arguments to &#8220;tar -cvzf&#8221; and wipe out something important.  Or perhaps you&#8217;ll screw up a firewall rule and block ssh access to the machine you are frantically trying to fix.  Accidently killing the ssh daemon is another favorite.  The point is that during a stressful situation in the datacenter, the human operator is the biggest potential source of more downtime or “catastrophic data loss.”  </p>
<p>Assuming you can’t automate everything, what can you do?  Well, the absolute best thing you can do is practice.  Corrupt some data on your dev master db, and see how long it takes you to get it restored from backups or a slave copy.  Practice what would happen if you lost a database slave and had to activate a spare machine to take its place (I hope you have at least one spare machine).  But of course, no one at a small startup has time to practice.  Maybe once you hire a full time ops guy, it would be good to make sure he is practicing this sort of thing occasionally.  But when practicing is going to take away from writing code, you aren&#8217;t going to practice.  </p>
<p>Since you aren’t going to practice, what else can you do?  The next best thing is to cultivate the attitude that you are the most likely source of problems.  Don’t worry about hard drives, worry about bad decisions.  Develop some humility about how you expect to behave when you get woken up at 4am to fix a database the morning of your launch or when a switch fails an hour before your big demo.  From that mindset, here are a few things to do:</p>
<p><strong>Script what you can</strong><br />
Off the top of my head, a good place to start would be writing scripts for some of the steps in setting up master/slave replication and manipulating firewall traffic (allowing or blocking external traffic, for instance).  </p>
<p><strong>Use the buddy system</strong><br />
It is not a bad idea to have somebody else there looking at what you are typing, or at least on the phone confirming things verbally.</p>
<p><strong>Take your fingers off the keyboard before you hit enter</strong><br />
Are you in the right directory?  Are you on the right machine?  Are those arguments in the right order?  Can you just rename this old stuff instead of deleting it?  All of these are excellent questions to ask yourself or your coworker while you have your hands in your lap. This is also a good idea when you are doing something scary with SQL, like running any query that doesn&#8217;t have a where clause.  </p>
<p><strong>Slow things down </strong><br />
As soon as you make one mistake, no matter how minor, it is time to slow things down.  Beyond the fact that making a mistake will fluster you, making one mistake demonstrates that right now, you are likely to make mistakes.  That is a huge red flag.  At this point, the safest thing may be to accept a slightly longer downtime just so you can slow things down, get some water, and relax.  Trying to compensate for a little mistake by doing things faster can result in a much, much worse mistake.  Unless you’ve just rolled a server cage down the stairs, there is always a worse mistake you can make.</p>
<p><strong>Make it hard for people you work with to make mistakes</strong><br />
A quality server naming scheme is the easiest thing you can do here.  No colors, deities, countries, snack foods, snakes, etc.  I like $machineType-$number myself, but with distinct number ranges, even between different machine types.  So, don&#8217;t have SQL-001 and Web-001.  One day, some very sleepy datacenter employee may get things mixed up when you call and ask him to reboot Web-001.  I’m sure you’ll get an apology, but you won’t get your uptime back.  So make it harder for him to screw up: if your web machines start at Web-201, he&#8217;ll have to make 2 mistakes before he accidently reboots your primary database.  </p>
<p><strong>Talk about this stuff ahead of time</strong><br />
You probably have plenty of stuff to talk about at lunch with your coworkers, but here are a few convers	ation starters if you want to sharpen your disaster recovery skills:</p>
<ul>
<li>&#8220;What happens if we lose power to one of our racks?&#8221;</li>
<li>&#8220;How many of our switches could we lose and get the site back up?&#8221;</li>
<li>&#8220;What is the smallest amount of hardware we could lose that would knock us 100% offline?&#8221;</li>
</ul>
<p>This stuff isn’t theoretical.  I woke up at 2am one weekend during FolderShare with a ton of text messages from our cluster.  The kindly folks at the datacenter had been doing power supply maintenance.  At some point, they powered down 2 of our racks.  Then, they powered them right back up.  It wasn’t tough to fix, but it was so unexpected that it took me a few minutes to even realize what had happened.  </p>
<p><strong>Use tricks to deal with the general class of &#8220;running a command on the wrong machine&#8221; problem.</strong><br />
Typing the right command on the wrong machine is obviously something to avoid.  But when you have a sea of ssh windows open, what can you do?  </p>
<ul>
<li>Use a different color background for your terminal to machines hosting master databases versus slaves</li>
<li>Make sure the machine name shows up in the command prompt </li>
</ul>
<p>Does anyone else have any good ideas or horror stories to tell?  Post a comment and share your wisdom and/or pain.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/08/11/handling-human-error-in-the-datacenter/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>tklein on twitter</title>
		<link>http://www.tomkleinpeter.com/2008/06/30/tklein-on-twitter/</link>
		<comments>http://www.tomkleinpeter.com/2008/06/30/tklein-on-twitter/#comments</comments>
		<pubDate>Mon, 30 Jun 2008 21:03:54 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=62</guid>
		<description><![CDATA[I&#8217;m on twitter now.  Follow me at http://twitter.com/tklein
]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m on twitter now.  Follow me at <a href="http://twitter.com/tklein">http://twitter.com/tklein</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/06/30/tklein-on-twitter/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Crashing When Something Feels Wrong</title>
		<link>http://www.tomkleinpeter.com/2008/04/14/crashing-when-something-feels-wrong/</link>
		<comments>http://www.tomkleinpeter.com/2008/04/14/crashing-when-something-feels-wrong/#comments</comments>
		<pubDate>Mon, 14 Apr 2008 19:47:37 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Assertions]]></category>
		<category><![CDATA[FolderShare]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=56</guid>
		<description><![CDATA[I’m sort of lazy, so I really like the idea of code that continually checks itself by using assertions.  I even like running production services with assertions turned on.  To be clear, I’m talking about assertions that check for actual bugs in your code – not assertions that socket() didn’t fail.  Still, [...]]]></description>
			<content:encoded><![CDATA[<p>I’m sort of lazy, so I really like the idea of code that continually checks itself by using assertions.  I even like running production services with assertions turned on.  To be clear, I’m talking about assertions that check for actual bugs in your code – not assertions that socket() didn’t fail.  Still, crashing production servers is a contentious issue, but sometimes (hopefully rarely) it is the best thing to do.  For something like FolderShare, crashing a server as soon as there is any hint of an error is vastly safer than possibly deleting someone’s files due to a bug.  Of course, this introduces the risk that you could have multiple servers fail in a short amount of time, but you need to design for that case anyway.  </p>
<p>I originally fell in love with assertions after reading Steve Maguire’s <em>Writing Solid Code</em> many years ago.  After I saw how helpful they could be, I started to structure my code to make it more “assertable.”  For example, I like state machines that use a table of valid transitions combined with assertions.  This prevents anything I don’t explicitly anticipate from happening and is really helpful in a networked, asynchronous world.  </p>
<p>But after developing a few long running services, I’ve started to perceive the need for a new type of assertion that is a bit higher level than a single conditional.  I want something that will let me crash a service (and save a memory dump) when something “feels” wrong.  For the moment, I call these “probabilistic assertions,” and I would have slept better while running FolderShare if I’d implemented them then.</p>
<p>Like all synchronization software, FolderShare had a few nightmare scenarios that I worried about all the time.  If, due to a bug, the service told every client that its files were all deleted, I’d probably have to read a lot of nasty blog posts, and I would feel like crap for a few years.  I would much rather crash all my servers and debug the problem with the system offline than take a chance on pissing off so many people.</p>
<p>Anyway, a normal assertion that looks at a single conditional wouldn’t help in this case.  Asserting that no files have been deleted when a client connects doesn’t make any sense.  But for a large enough sample size, I could assert that 80% of clients that connect shouldn’t have any files deleted.  And I could probably assert that 95% of clients that connect shouldn’t have all their files deleted.  </p>
<p>Functions like these cover most of what I’m thinking about:</p>
<p><img src="http://www.spiteful.com/wp-content/uploads/2008/04/prob_assertions.png" alt="" title="prob_assertions" width="497" height="132" class="aligncenter size-full wp-image-58" /></p>
<p>Depending on your application, these functions could be implemented either as assertions or as triggers to alert an administrator.  For alerts, you could really take this to another level by checking that incoming events are following a certain distribution for example.  Granted that it’s a bit over the top, but I can’t help but imagine checking if a stream of incoming events obeys a Poisson distribution or if the sizes of certain data are following a Gaussian distribution.  </p>
<p>Has anyone seen anything like this in the wild?  I’d love to hear about it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/04/14/crashing-when-something-feels-wrong/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Housekeeping</title>
		<link>http://www.tomkleinpeter.com/2008/04/14/housekeeping/</link>
		<comments>http://www.tomkleinpeter.com/2008/04/14/housekeeping/#comments</comments>
		<pubDate>Mon, 14 Apr 2008 19:43:33 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/?p=55</guid>
		<description><![CDATA[I don&#8217;t plan on having many non-technical posts here, but I&#8217;m breaking my rule today for a good reason.  I&#8217;ve got a kid now!  My first child, Margot Lee Kleinpeter, was born about 10 days ago.  Between a long, drawn out labor, a few nights on a hospital couch, and fatherhood in [...]]]></description>
			<content:encoded><![CDATA[<p>I don&#8217;t plan on having many non-technical posts here, but I&#8217;m breaking my rule today for a good reason.  I&#8217;ve got a kid now!  My first child, Margot Lee Kleinpeter, was born about 10 days ago.  Between a long, drawn out labor, a few nights on a hospital couch, and fatherhood in general, I&#8217;ve fallen a bit behind on publishing.  Much to my surprise, Margot prefers clean diapers and songs to essays on startups and programming.  But, I&#8217;ve got a new post for today and I&#8217;ll hopefully be back on a more normal schedule soon.  In the meantime, enjoy this picture of her sleeping:</p>
<p><img src="http://www.spiteful.com/wp-content/uploads/2008/04/margot.jpg" alt="" title="margot" width="500" height="333" class="aligncenter size-full wp-image-60" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/04/14/housekeeping/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Introduction</title>
		<link>http://www.tomkleinpeter.com/2008/02/21/introduction/</link>
		<comments>http://www.tomkleinpeter.com/2008/02/21/introduction/#comments</comments>
		<pubDate>Fri, 22 Feb 2008 04:35:16 +0000</pubDate>
		<dc:creator>Tom</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Audiogalaxy]]></category>
		<category><![CDATA[FolderShare]]></category>
		<category><![CDATA[STS]]></category>

		<guid isPermaLink="false">http://www.spiteful.com/2008/02/21/introduction/</guid>
		<description><![CDATA[I went to a meeting run by the Seattle Tech Startup folks a few weeks ago.   Even though I&#8217;m not thinking about doing another startup right now, I was glad to see the enthusiasm of all the other people who are.   Because I love seeing the new ideas that come out [...]]]></description>
			<content:encoded><![CDATA[<p>I went to a meeting run by the Seattle Tech Startup folks a few weeks ago.   Even though I&#8217;m not thinking about doing another startup right now, I was glad to see the enthusiasm of all the other people who are.   Because I love seeing the new ideas that come out of startups, I really hate seeing them fail as a result of them making the same silly mistakes.  So, the collaboration that the STS meetings and the associated mailing list promote really put a smile on my face.</p>
<p>I&#8217;ve played a big part in two successful startups, and my two had very different flavors.   Audiogalaxy was a rocket-ship ride that didn&#8217;t let up for three years until it all came crashing down due to a lawsuit.   We had traffic from the minute we turned on the Satellite, and all we had to do was scale it up as quickly as we possibly could.  FolderShare, on the other hand, sometimes felt like a continuous series of failures until we had success all at once&#8211;a glowing review from Walt Mossberg while we&#8217;re negotiating our acquisition by Microsoft.  That was nice.</p>
<p>Now that I&#8217;ve got some free time, one of the goals for this blog is to reflect a little bit on my experiences and what I might change next time. Stay tuned.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.tomkleinpeter.com/2008/02/21/introduction/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
