When I was working on Live Mesh at Microsoft, I had the good fortune to meet James Hamilton. James is full of good ideas, many of which are captured in his paper “On Designing and Deploying Internet-Scale Services.” There is a lot of wisdom in those pages (Greg Linden had some thoughts on it), but I’d like to focus in on this snippet in particular:
Design the system to never need human interaction, but understand that rare events will occur where combined failures or unanticipated failures require human interaction.
Yes, designing the system to never need human interaction is a great ideal to shoot for, but when you are working for a startup with three guys and a dozen servers, you don’t have the resources or the justification to do it from Day 1. It is entirely likely that your business model will fail before you lose a single disk. And since backend refinements don’t pay the bills at a small scale, something with a pair of hands is going to be interacting with your system until you get enough people and servers to justify more automation.
These events will happen and operator error under these circumstances is a common source of catastrophic data loss.
That is a wonderfully simple and accurate summary of How Bad Things Happen in your datacenter. It starts when you lose a hard drive or MySQL crashes, and you have to promote your slave until you can check the master tables, or anything painful but routine happens. But then, as you are trying to fix things, you notice, for example, that you are almost out of disk space. When you start trying to fix more than one problem under pressure, you are entering a world of pain.
The big issue here as James points out is that you are going to do something wrong. You’ll probably use a much stronger word than “wrong” once it is all over, but let’s settle on “stupid” for right now.
It won’t feel stupid until after you hit “enter,” but when you are making unfamiliar decisions quickly under pressure, you are extremely likely to overlook something. Maybe you won’t shut down mysql before you start a myisamchk from the shell, or maybe you’ll reverse the arguments to “tar -cvzf” and wipe out something important. Or perhaps you’ll screw up a firewall rule and block ssh access to the machine you are frantically trying to fix. Accidently killing the ssh daemon is another favorite. The point is that during a stressful situation in the datacenter, the human operator is the biggest potential source of more downtime or “catastrophic data loss.”
Assuming you can’t automate everything, what can you do? Well, the absolute best thing you can do is practice. Corrupt some data on your dev master db, and see how long it takes you to get it restored from backups or a slave copy. Practice what would happen if you lost a database slave and had to activate a spare machine to take its place (I hope you have at least one spare machine). But of course, no one at a small startup has time to practice. Maybe once you hire a full time ops guy, it would be good to make sure he is practicing this sort of thing occasionally. But when practicing is going to take away from writing code, you aren’t going to practice.
Since you aren’t going to practice, what else can you do? The next best thing is to cultivate the attitude that you are the most likely source of problems. Don’t worry about hard drives, worry about bad decisions. Develop some humility about how you expect to behave when you get woken up at 4am to fix a database the morning of your launch or when a switch fails an hour before your big demo. From that mindset, here are a few things to do:
Script what you can
Off the top of my head, a good place to start would be writing scripts for some of the steps in setting up master/slave replication and manipulating firewall traffic (allowing or blocking external traffic, for instance).
Use the buddy system
It is not a bad idea to have somebody else there looking at what you are typing, or at least on the phone confirming things verbally.
Take your fingers off the keyboard before you hit enter
Are you in the right directory? Are you on the right machine? Are those arguments in the right order? Can you just rename this old stuff instead of deleting it? All of these are excellent questions to ask yourself or your coworker while you have your hands in your lap. This is also a good idea when you are doing something scary with SQL, like running any query that doesn’t have a where clause.
Slow things down
As soon as you make one mistake, no matter how minor, it is time to slow things down. Beyond the fact that making a mistake will fluster you, making one mistake demonstrates that right now, you are likely to make mistakes. That is a huge red flag. At this point, the safest thing may be to accept a slightly longer downtime just so you can slow things down, get some water, and relax. Trying to compensate for a little mistake by doing things faster can result in a much, much worse mistake. Unless you’ve just rolled a server cage down the stairs, there is always a worse mistake you can make.
Make it hard for people you work with to make mistakes
A quality server naming scheme is the easiest thing you can do here. No colors, deities, countries, snack foods, snakes, etc. I like $machineType-$number myself, but with distinct number ranges, even between different machine types. So, don’t have SQL-001 and Web-001. One day, some very sleepy datacenter employee may get things mixed up when you call and ask him to reboot Web-001. I’m sure you’ll get an apology, but you won’t get your uptime back. So make it harder for him to screw up: if your web machines start at Web-201, he’ll have to make 2 mistakes before he accidently reboots your primary database.
Talk about this stuff ahead of time
You probably have plenty of stuff to talk about at lunch with your coworkers, but here are a few convers ation starters if you want to sharpen your disaster recovery skills:
- “What happens if we lose power to one of our racks?”
- “How many of our switches could we lose and get the site back up?”
- “What is the smallest amount of hardware we could lose that would knock us 100% offline?”
This stuff isn’t theoretical. I woke up at 2am one weekend during FolderShare with a ton of text messages from our cluster. The kindly folks at the datacenter had been doing power supply maintenance. At some point, they powered down 2 of our racks. Then, they powered them right back up. It wasn’t tough to fix, but it was so unexpected that it took me a few minutes to even realize what had happened.
Use tricks to deal with the general class of “running a command on the wrong machine” problem.
Typing the right command on the wrong machine is obviously something to avoid. But when you have a sea of ssh windows open, what can you do?
- Use a different color background for your terminal to machines hosting master databases versus slaves
- Make sure the machine name shows up in the command prompt
Does anyone else have any good ideas or horror stories to tell? Post a comment and share your wisdom and/or pain.