Handling Human Error In the Datacenter

When I was working on Live Mesh at Microsoft, I had the good fortune to meet James Hamilton. James is full of good ideas, many of which are captured in his paper “On Designing and Deploying Internet-Scale Services.” There is a lot of wisdom in those pages (Greg Linden had some thoughts on it), but I’d like to focus in on this snippet in particular:

Design the system to never need human interaction, but understand that rare events will occur where combined failures or unanticipated failures require human interaction.

Yes, designing the system to never need human interaction is a great ideal to shoot for, but when you are working for a startup with three guys and a dozen servers, you don’t have the resources or the justification to do it from Day 1. It is entirely likely that your business model will fail before you lose a single disk. And since backend refinements don’t pay the bills at a small scale, something with a pair of hands is going to be interacting with your system until you get enough people and servers to justify more automation.

These events will happen and operator error under these circumstances is a common source of catastrophic data loss.

That is a wonderfully simple and accurate summary of How Bad Things Happen in your datacenter. It starts when you lose a hard drive or MySQL crashes, and you have to promote your slave until you can check the master tables, or anything painful but routine happens. But then, as you are trying to fix things, you notice, for example, that you are almost out of disk space. When you start trying to fix more than one problem under pressure, you are entering a world of pain.

The big issue here as James points out is that you are going to do something wrong. You’ll probably use a much stronger word than “wrong” once it is all over, but let’s settle on “stupid” for right now.

It won’t feel stupid until after you hit “enter,” but when you are making unfamiliar decisions quickly under pressure, you are extremely likely to overlook something. Maybe you won’t shut down mysql before you start a myisamchk from the shell, or maybe you’ll reverse the arguments to “tar -cvzf” and wipe out something important. Or perhaps you’ll screw up a firewall rule and block ssh access to the machine you are frantically trying to fix. Accidently killing the ssh daemon is another favorite. The point is that during a stressful situation in the datacenter, the human operator is the biggest potential source of more downtime or “catastrophic data loss.”

Assuming you can’t automate everything, what can you do? Well, the absolute best thing you can do is practice. Corrupt some data on your dev master db, and see how long it takes you to get it restored from backups or a slave copy. Practice what would happen if you lost a database slave and had to activate a spare machine to take its place (I hope you have at least one spare machine). But of course, no one at a small startup has time to practice. Maybe once you hire a full time ops guy, it would be good to make sure he is practicing this sort of thing occasionally. But when practicing is going to take away from writing code, you aren’t going to practice.

Since you aren’t going to practice, what else can you do? The next best thing is to cultivate the attitude that you are the most likely source of problems. Don’t worry about hard drives, worry about bad decisions. Develop some humility about how you expect to behave when you get woken up at 4am to fix a database the morning of your launch or when a switch fails an hour before your big demo. From that mindset, here are a few things to do:

Script what you can
Off the top of my head, a good place to start would be writing scripts for some of the steps in setting up master/slave replication and manipulating firewall traffic (allowing or blocking external traffic, for instance).

Use the buddy system
It is not a bad idea to have somebody else there looking at what you are typing, or at least on the phone confirming things verbally.

Take your fingers off the keyboard before you hit enter
Are you in the right directory? Are you on the right machine? Are those arguments in the right order? Can you just rename this old stuff instead of deleting it? All of these are excellent questions to ask yourself or your coworker while you have your hands in your lap. This is also a good idea when you are doing something scary with SQL, like running any query that doesn’t have a where clause.

Slow things down
As soon as you make one mistake, no matter how minor, it is time to slow things down. Beyond the fact that making a mistake will fluster you, making one mistake demonstrates that right now, you are likely to make mistakes. That is a huge red flag. At this point, the safest thing may be to accept a slightly longer downtime just so you can slow things down, get some water, and relax. Trying to compensate for a little mistake by doing things faster can result in a much, much worse mistake. Unless you’ve just rolled a server cage down the stairs, there is always a worse mistake you can make.

Make it hard for people you work with to make mistakes
A quality server naming scheme is the easiest thing you can do here. No colors, deities, countries, snack foods, snakes, etc. I like $machineType-$number myself, but with distinct number ranges, even between different machine types. So, don’t have SQL-001 and Web-001. One day, some very sleepy datacenter employee may get things mixed up when you call and ask him to reboot Web-001. I’m sure you’ll get an apology, but you won’t get your uptime back. So make it harder for him to screw up: if your web machines start at Web-201, he’ll have to make 2 mistakes before he accidently reboots your primary database.

Talk about this stuff ahead of time
You probably have plenty of stuff to talk about at lunch with your coworkers, but here are a few convers ation starters if you want to sharpen your disaster recovery skills:

  • “What happens if we lose power to one of our racks?”
  • “How many of our switches could we lose and get the site back up?”
  • “What is the smallest amount of hardware we could lose that would knock us 100% offline?”

This stuff isn’t theoretical. I woke up at 2am one weekend during FolderShare with a ton of text messages from our cluster. The kindly folks at the datacenter had been doing power supply maintenance. At some point, they powered down 2 of our racks. Then, they powered them right back up. It wasn’t tough to fix, but it was so unexpected that it took me a few minutes to even realize what had happened.

Use tricks to deal with the general class of “running a command on the wrong machine” problem.
Typing the right command on the wrong machine is obviously something to avoid. But when you have a sea of ssh windows open, what can you do?

  • Use a different color background for your terminal to machines hosting master databases versus slaves
  • Make sure the machine name shows up in the command prompt

Does anyone else have any good ideas or horror stories to tell? Post a comment and share your wisdom and/or pain.

8 Responses to “Handling Human Error In the Datacenter”

  • I rock 3 monitors at work and always have screens divided mentally by importance. Screen one is live / check it all twice before hitting enter type of data. Screen two is usually secondary, test or backup servers to pay attention to what was done so it can be duplicated. Screen three is email and web research… Or depending on the situation each screen is a different geographic location. Biggest Trip Ups I have to watch out for are the “Are you in the right directory? Are you on the right machine? Right Data? Right Dates?” type stuff

    Side Note: The instructor to a CCNA class I was taking walked around during our practice tests with a cheep cell phone refrigerator magnet and simulated the “real world” by making it ring while we took the test. Damn thing stressed us out, but didn’t even come close to the owner calling you wanting to know when the database will be back up and you are sitting in the data center with a server in your lap replacing a mother board because your contacted repair man can not get there for a few hours..

    I love this job…

  • I don’t think you can really stress the importance of renaming files (rather than deleting them) enough.

    Whenever I’m making a huge configuration change to a live system, I copy the entire directory about to be affected… and leave the copy around. Most of the time, that copy remains untouched until I’m making another big change (at which point I’ll generally finally delete it in favor of a backup of the current live version). But maybe 1 in 3 or 1 in 4 times… that backup will save you hours of work, or more.

    One general rule of thumb of server maintenance that you didn’t mention is to always operate as a normal user until you’re absolutely sure you need to escalate privileges. And then make sure to relinquish root when you’re done with that individual task. It requires more typing, but if you can resist the urge to leave root prompts open on 5 different terminals at once… ;)

  • That’s why I use Slicehost. RAID10, nightly backups and someone else getting paged when stuff breaks.

  • This is really more about automation, but tools like Capistrano (http://capify.org) can be really helpful for managing a number of servers in different roles. I don’t have a lot of experience with this tool (it’s written in Ruby and used heavily by the Rails community), but it seems useful.

    To the extent that this reduces the number of steps required to do something (even if those steps are simple things like opening an ssh session), you can reduce mental fatigue and stay focused on getting right the things that do require some mental effort.

    Are there any other tools like this that people recommend?

  • A minor tool, but molly-guard has saved me more than once.
    Basically if you call shutdown/reboot from an ssh session, it prompts you for the hostname that you wish to shutdown, hopefully preventing a shutdown of a server that isn’t within physical reach in the middle of the night.


    Capistrano looks interesting though, I’ll have figure out what it gets me if I’m not using RoR.

  • I’m not so sure about his `$machineType-$number` scheme. When dealing with the physical boxen, I find it a lot easier to just say “power down lucy” or “attach that FireWire drive to charlie”. CNAMEs exist so that I can map these physical names to their roles — my home directories will always live on “users”, even after I migrate them to a SAN. “www” will always be my webserver, even after I migrate it to a virtual machine.

  • I just saw him give this talk at Amazon last week. It was really really good. The best thing that I got out of it was the concept of a ‘canary host’. That’s a host that you run hotter than the rest (2x or more) so that you can see it brown out and fail before the rest of the hosts. It lets you see what the failure behavior is and gives you warning before your entire fleet of boxes are failing at once.

  • @Mathew Duafala — agreed, that is a really good idea.

Leave a Reply