I’m sort of lazy, so I really like the idea of code that continually checks itself by using assertions. I even like running production services with assertions turned on. To be clear, I’m talking about assertions that check for actual bugs in your code – not assertions that socket() didn’t fail. Still, crashing production servers is a contentious issue, but sometimes (hopefully rarely) it is the best thing to do. For something like FolderShare, crashing a server as soon as there is any hint of an error is vastly safer than possibly deleting someone’s files due to a bug. Of course, this introduces the risk that you could have multiple servers fail in a short amount of time, but you need to design for that case anyway.

When you are running a distributed service in a datacenter, you encounter a lot of interesting problems. At Audiogalaxy, I ran into all the standard application level bugs, crashes, and race conditions. Once we had a certain number of machines, we even had to deal with flaky memory, disks, and networking cards. But all of that was pretty typical compared to the weirdest bug I ever had to deal with – the one that was caused by Quake III Arena.