Things That Are Important: Where Clauses

quake-3-bones.jpgWhen you are running a distributed service in a datacenter, you encounter a lot of interesting problems. At Audiogalaxy, I ran into all the standard application level bugs, crashes, and race conditions. Once we had a certain number of machines, we even had to deal with flaky memory, disks, and networking cards. But all of that was pretty typical compared to the weirdest bug I ever had to deal with – the one that was caused by Quake III Arena.

Audiogalaxy had a small client that simply handled the P2P transfers and a complicated website for everything else, including account settings. One of the adjustable account settings on the website was the “max number of transfers.” To encourage users to send as much as they received, we only gave them a single number for this setting. With a value of 1, a Satellite would only send a single file at a time, but it could only download one file a time as well.

Things were not so simple on the back-end. For better or for worse, I had designed some flexibility into the system. The max transfers value was actually stored in two columns in the Users table – MaxSend and MaxRecv. The back-end – the part that actually looked at these values when it was setting up transfers–had no idea these columns were linked. The front-end enforced what went into the database, and the back-end obeyed it. Whenever the Satellite reconnected to the cloud, our server would read the value out of the database and store it in memory for the duration of that connection.

Of course, somewhere between the frontend and the backend is mysqlclient, but I’ll get to that in a moment.

Quake III Arena was my game of choice at the time I worked for AG. We had a few developers that also enjoyed the game, and it was common to find people staying late on the weekend to take advantage of our nice internet connection. Unfortunately, our nice internet connection had a dozen people running our p2p music sharing client on one side, so it would periodically slow down when someone’s computer started blasting a file out at high speed. These slowdowns drove us crazy, particularly when they prevented us from using the game’s rail gun effectively.

Good developers like to fix problems, and developers at startups also tend to have access to the database. So, you can probably imagine what a developer might do. And if you know a little bit about SQL, you can also imagine what might go horribly wrong. I never found out who issued the bad query, but I can just imagine how it played out:

Hey, I’ve got an idea about how we can keep the games from lagging tonight. I’ll just block everyone in the office from sending files. One simple ‘Update Users set MaxSend = 0’ and we should be good to go for the evening… Why is that query taking so long? Uh oh…

SQL is good for a lot of things, but I’ve always marveled at how easy it is to destroy an entire table simply by forgetting a where clause. And thus, in a few short minutes, every one of our 30 million users had a subtle change applied to their accounts. Did I mention that the single value we displayed on the website for this setting came from the MaxRecv column? Whoops…

Monitoring the health of the system was one of my jobs, so I kept a close eye on my graph of the “current transfer rate.” Ultimately, most problems in the system resulted in less files getting transferred, so the global transfer rate was a good proxy for the health of the system.

Every day of the week plotted a unique and predictable curve that I knew by heart, and so it didn’t take me long to realize that something was wrong. Transfer rates were dropping. But why? I called our ISP and asked if they knew of any problems with the Internet. Nope. We had exactly the right number of clients connected. No one had trenched over a fiber optic cable in the middle of nowhere. Requests were coming into the system at the normal rate; they just weren’t getting fulfilled. Microsoft hadn’t pushed any patches out that might have firewalled off half the world.

Clients generally stayed connected for days or weeks at a time. As they gradually reconnected, more and more of the network got their new MaxSend setting and dutifully started not sending anything. Users weren’t complaining – it was perfectly normal for rare songs to be inaccessible, and nobody noticed if his client just wasn’t sending anything.

After tearing my hair out for a day or so about this, I finally realized I was seeing a lot more “client busy – no free slots” type messages than I usually did while tail –f’ing the log files. Digging into that, I noticed some other funny messages, and eventually I was staring in shock at the results of a “select MaxSend, MaxRecv from Users limit 1000.”

Fixing the problem was easy enough: “Update Users set MaxSend = MaxRecv,” but you can imagine I spent quite some time staring at that query before issuing it.

mysql-i-am-a-dummySo what’s the moral of the story? Don’t let your developers have access to the production database? Maybe, but that isn’t practical for a small startup. Better logging? That certainly could help. Force everyone to access the database using the –i-am-a-dummy flag for MySQL? That is not a bad idea and will get you some of the way there, but a shoddily written script can do exactly the same kind of damage. Backups? Sure, we had backups, but we were adding customers so quickly that restoring data more than a few hours old would have pissed off many thousands of people. An Admin class of users, with configurable policy that prevented them from sending files between 7pm and 3am on weekends? Yeah, right.

If you run a big and complicated system, problems you will never predict are going to happen and cause your system to do impossibly weird things you don’t expect. You must invest in tools to give you visibility into your system. My transfer rate graph was the only reason I was even able to go looking for a problem. I knew something was wrong, and it was just a matter of digging until I found it. Let your admins see into the system (specifically – how the system is behaving right now) so that they can develop intuition about what it should look like. Finding a bug in production is never fun. But it is going to happen, and it is always better if you find it before your users do.

14 Responses to “Things That Are Important: Where Clauses”

  • That is one irresponsible developer. I would have tracked him down and have him fired for that. Not for making the mess in the first place, mind you, but for not cleaning it up afterwards.

  • I’ve really been enjoying reading about your experiences at AudioGalaxy – keep it up!

  • I miss Audiogalaxy! It had a volume of a cappella music I haven’t been able to find since :(
    Cool story by the way.

  • great article! thanks for taking the time!

  • Jolly good story!

    I quite agree that it was careless (irresponsible?) of the unknown developer for making that screwup in the first place, but to not ‘fess up and fix it afterwards? (you: “The system’s going to shit. What changed today?” them: “sound of crickets chirping and wind blowing”) Totally unacceptable.

    Like most devs, I don’t mind so much when somebody screws up (everybody does) but let me tell you what, not admitting to it (or worse yet, covering it up) makes the situation 1000% worse because it delays the solution and damages trust which is the foundation of any working team.

    Thanks again for sharing an Audiogalaxy “war story.”

  • Thinking about it some more, I’m guessing that they ctrl-c’d the query pretty quickly and thought that would prevent the update. We had a good group of folks working there, and I’m pretty sure that if they knew what had happened, they would have let me know. Oh well — it makes for a good story. :)

  • This is also another case where forcing people to use transactions acts as a sort of insurance policy.

    > BEGIN

    > UPDATE [something dumb goes here]
    “Why’s this taking… oh damn!”

  • “Don’t let your developers have access to the production database? Maybe, but that isn’t practical for a small startup.”

    Why not? That’s like saying that sysadmins should be dabbling in the accounting. Developers should NOT touch the production database. It’s called production for a reason. If they want to look at the data, you can / should have a nightly snapshot back onto a performance database or some such they can look at. Yes, even for a small startup.

  • @wac: Yes, transactions would have been nice. But of course, we were on MyISAM. More of a disk backed hashtable than a database, really.

    @matt: I guess I should say that for us, it wouldn’t have been practical. Mirroring our production DB would have required about 15 machines, which we simply couldn’t afford. We did have a relatively empty dev database that we used as much as we could, and we certainly didn’t encourage folks to hit the production machines, but sometimes it made life easier.

  • What about the obvious moral? You don’t make changes to production databases unless you are fulfilling a business requirement. Temporary hacks are bad enough, but temporary hacks so you can get your game on? Impressively unprofessional.

    I like a good game of Q3 as much as the next guy, but there are some things you just don’t do!

  • @Chuck: That is certainly a lesson I hoped the responsible developer learned! My point is that from a higher level you need good tools to detect weird problems with your cluster. It may be a careless developer, or it may be a hacker, or a bad script, or a faulty switch, or something entirely unexpected.

  • I remember forgetting the where clause and accidentally blowing away the Audiogalaxy user data table. Luckily we had a backup. That was when there were only like 3000 rows in that table.

    Matt, don’t forget we were partying like it was 1999. Since it was, you know, 1999.

Leave a Reply