This is the second article in my three part history about building Audiogalaxy.com. You should probably read the first one first.
I came back from my Christmas break feeling less burnt out. I focused on designing a backend that could handle 100,000 simultaneous Satellites and then started building it. To free me up from working on the client, Michael bought a copy of the Steven’s networking book and started working on a C version of the Satellite core. And David hired Kennon Ballou to help build the next-generation web interface.
The new backend and client went live in April, using my humble website from V1. Traffic started growing steadily, and by the end of May, we had about 3,000 clients connected at peak times. Sometime around the end of July, there was a Napster injunction scare, which pushed us over the 8,000 mark. We released version 0.6 of the client and David’s beautiful new website in September. At that point, our peak load started increasing by thousands of users every week.
In the way that only 22-year-old unattached guys can do, I revolved my life around work. The kitchen was full of free food, and if I didn’t eat there, I ate lunch and dinner with Michael or anyone else that happened to be around. I went to the gym in the evening with Michael and Geoff and then back to the office. When things got too intense, the distractions of 6th Street were a short walk away. I remember going back to the office to work on low-priority stuff while I sobered up before driving home. At my apartment, my furniture was limited to a twin mattress, a chair, and a folding table for my computer. Every week we had more users, and every time something in the cloud melted down and we had to scramble to fix it, our crew became a little bit more tightly knit.
This was the year that we figured out how to manage a huge website with a tiny team. Geoff handled the hardware, I handled any services I had written and did a little bit of MySQL work, and Michael did everything else: he built Linux kernels, handled Apache and the load balancers, and dealt with any tricky aspects of MySQL. We had a set of internal pages for monitoring the site, and we checked them as soon as we got out of bed, reflexively throughout the day, and right before going to sleep. We all had weird schedules so we managed to cover a good chunk of every day, and it was always fair game to wake someone else up at 4am to diagnose a problem. This wasn’t really designed or planned – it just happened because it needed to be done.
The excitement about how quickly we were scaling up was tempered in December. We had an amazing advertising contract for a flat $5 CPM. Eventually, they realized how much money they owed us and decided to just not pay us. Michael had to let a few reviewers go and borrow $75,000 to make payroll. We all got a lesson about how low the startup roller coaster can go.
Thanks to the lessons we learned from V1, the hardware, rather than the software, became the bottleneck for V2 of the Satellite service. We quickly surpassed our goal of 100,000 simultaneous users and realized that target was missing at least one zero. We watched graphs of how many users were online, but instead of seeing beautiful diurnal curves, we saw harsh, flat caps as we bumped into capacity limits. Due to limitations of Java at the time, each server could only handle 2000 simultaneous users, and our users let us know this was a problem. Michael kept a fax some user had sent in with “Buy A New Server!” scrawled across it hanging up in his office.
We definitely tried. And we definitely bought a lot of servers. Every few weeks we would get a batch of parts delivered to the datacenter. Geoff would drive out and spend 24 highly caffeinated hours assembling and racking 20 new servers with a manual screwdriver that often left his hands bloody. But it seemed like we could never keep up. We filled out one row at the datacenter and started working on a second:
In the fall of 2001 I read about /dev/epoll and decided it was the key to a vastly more scalable backend written in C. We hired Steven Hazel, a talented and practical developer from the FreeNet project to help with the project. I knew we had the right guy when he asked how we were going to avoid Second System Syndrome during the port during our first meeting.
Despite the pain of converting a lot of the string processing we did from Java to C, testing the new version was pretty easy. We loaded the service down with assertions and rigid state machines, and ran a single instance in the cluster. The flood of traffic would initially drive the process into an assertion within seconds.
As we worked through the bugs, the uptime turned into minutes, then hours, and eventually the service virtually never crashed. With hundreds of instances deployed, we got so much traffic that we were able to remove all the bugs we were likely to run into. We had one or two machines that would crash every month or two with inscrutable core files. Because it was always the same machine, I eventually attributed this to faulty memory. The idea that you could write software that was more reliable than hardware was fascinating to me.
In fact, almost everything about the scale of the software fascinated me. I found that a system with hundreds of thousands of clients and thousands of events per seconds behaved like a physical machine built out of springs and weights. If one server process stalled for a moment, effects could ripple throughout the cluster. Sometimes, it seemed like there were physical oscillations – huge bursts of traffic would cause everything to slow down and back off, and then things would recover enough to trigger another burst of load. I had never even imagined that these sorts of problems existed in the software world and I found myself wishing I had taken control theory more seriously in college.
Keeping up with the traffic at this time was difficult, but in retrospect, it was really a lot of fun. I had graduated from UT in December of 2000 and moved downtown within walking distance of both 6th Street and the office. I spent the summer on a completely nocturnal cycle, partially because of the Texas heat, but mainly because restarting services was easier at 3 in the morning. I was tired of staying up late to deploy new code, so I just changed my schedule. Audiogalaxy users had led me to a set of live trance mixes from clubs in Europe which turned me into a diehard electronica fan, and driving around Texas to catch DJs on the weekend was much easier if staying up until 8am was normal. I bought some turntables and a lot of vinyl. And a couch. The light in my apartment when I got home in the morning was very lovely.
Audiogalaxy had gotten big enough at this point that it was not uncommon to overhear strangers talking about it, which was one of the best perks of working at the company. The community-based features of the site had also really taken off and eighty thousand messages were posted per day at our peak. A couple met on our message boards and got married shortly after that. A community of several hundred spinning instructors formed to share ideas about music selection during class. Several years later, a die-hard message board fan got the AG logo tattooed onto his leg:
Life was hectic but good. We were having so much success it was almost like something bad was about to happen.
Continued in Part 3