When you are running a distributed service in a datacenter, you encounter a lot of interesting problems. At Audiogalaxy, I ran into all the standard application level bugs, crashes, and race conditions. Once we had a certain number of machines, we even had to deal with flaky memory, disks, and networking cards. But all of that was pretty typical compared to the weirdest bug I ever had to deal with – the one that was caused by Quake III Arena.
Audiogalaxy had a small client that simply handled the P2P transfers and a complicated website for everything else, including account settings. One of the adjustable account settings on the website was the “max number of transfers.” To encourage users to send as much as they received, we only gave them a single number for this setting. With a value of 1, a Satellite would only send a single file at a time, but it could only download one file a time as well.
Things were not so simple on the back-end. For better or for worse, I had designed some flexibility into the system. The max transfers value was actually stored in two columns in the Users table – MaxSend and MaxRecv. The back-end – the part that actually looked at these values when it was setting up transfers–had no idea these columns were linked. The front-end enforced what went into the database, and the back-end obeyed it. Whenever the Satellite reconnected to the cloud, our server would read the value out of the database and store it in memory for the duration of that connection.
Of course, somewhere between the frontend and the backend is mysqlclient, but I’ll get to that in a moment.
Quake III Arena was my game of choice at the time I worked for AG. We had a few developers that also enjoyed the game, and it was common to find people staying late on the weekend to take advantage of our nice internet connection. Unfortunately, our nice internet connection had a dozen people running our p2p music sharing client on one side, so it would periodically slow down when someone’s computer started blasting a file out at high speed. These slowdowns drove us crazy, particularly when they prevented us from using the game’s rail gun effectively.
Good developers like to fix problems, and developers at startups also tend to have access to the database. So, you can probably imagine what a developer might do. And if you know a little bit about SQL, you can also imagine what might go horribly wrong. I never found out who issued the bad query, but I can just imagine how it played out:
Hey, I’ve got an idea about how we can keep the games from lagging tonight. I’ll just block everyone in the office from sending files. One simple ‘Update Users set MaxSend = 0’ and we should be good to go for the evening… Why is that query taking so long? Uh oh…
SQL is good for a lot of things, but I’ve always marveled at how easy it is to destroy an entire table simply by forgetting a where clause. And thus, in a few short minutes, every one of our 30 million users had a subtle change applied to their accounts. Did I mention that the single value we displayed on the website for this setting came from the MaxRecv column? Whoops…
Monitoring the health of the system was one of my jobs, so I kept a close eye on my graph of the “current transfer rate.” Ultimately, most problems in the system resulted in less files getting transferred, so the global transfer rate was a good proxy for the health of the system.
Every day of the week plotted a unique and predictable curve that I knew by heart, and so it didn’t take me long to realize that something was wrong. Transfer rates were dropping. But why? I called our ISP and asked if they knew of any problems with the Internet. Nope. We had exactly the right number of clients connected. No one had trenched over a fiber optic cable in the middle of nowhere. Requests were coming into the system at the normal rate; they just weren’t getting fulfilled. Microsoft hadn’t pushed any patches out that might have firewalled off half the world.
Clients generally stayed connected for days or weeks at a time. As they gradually reconnected, more and more of the network got their new MaxSend setting and dutifully started not sending anything. Users weren’t complaining – it was perfectly normal for rare songs to be inaccessible, and nobody noticed if his client just wasn’t sending anything.
After tearing my hair out for a day or so about this, I finally realized I was seeing a lot more “client busy – no free slots” type messages than I usually did while tail –f’ing the log files. Digging into that, I noticed some other funny messages, and eventually I was staring in shock at the results of a “select MaxSend, MaxRecv from Users limit 1000.”
Fixing the problem was easy enough: “Update Users set MaxSend = MaxRecv,” but you can imagine I spent quite some time staring at that query before issuing it.
So what’s the moral of the story? Don’t let your developers have access to the production database? Maybe, but that isn’t practical for a small startup. Better logging? That certainly could help. Force everyone to access the database using the –i-am-a-dummy flag for MySQL? That is not a bad idea and will get you some of the way there, but a shoddily written script can do exactly the same kind of damage. Backups? Sure, we had backups, but we were adding customers so quickly that restoring data more than a few hours old would have pissed off many thousands of people. An Admin class of users, with configurable policy that prevented them from sending files between 7pm and 3am on weekends? Yeah, right.
If you run a big and complicated system, problems you will never predict are going to happen and cause your system to do impossibly weird things you don’t expect. You must invest in tools to give you visibility into your system. My transfer rate graph was the only reason I was even able to go looking for a problem. I knew something was wrong, and it was just a matter of digging until I found it. Let your admins see into the system (specifically – how the system is behaving right now) so that they can develop intuition about what it should look like. Finding a bug in production is never fun. But it is going to happen, and it is always better if you find it before your users do.