When I was working on Live Mesh at Microsoft, I had the good fortune to meet James Hamilton. James is full of good ideas, many of which are captured in his paper “On Designing and Deploying Internet-Scale Services.” There is a lot of wisdom in those pages (Greg Linden had some thoughts on it), but I’d like to focus in on this snippet in particular:
Design the system to never need human interaction, but understand that rare events will occur where combined failures or unanticipated failures require human interaction.
Continue reading ‘Handling Human Error In the Datacenter’
When you are running a distributed service in a datacenter, you encounter a lot of interesting problems. At Audiogalaxy, I ran into all the standard application level bugs, crashes, and race conditions. Once we had a certain number of machines, we even had to deal with flaky memory, disks, and networking cards. But all of that was pretty typical compared to the weirdest bug I ever had to deal with – the one that was caused by Quake III Arena.
Continue reading ‘Things That Are Important: Where Clauses’