I really enjoy writing the code that makes systems like Audiogalaxy and FolderShare run. Getting into the zone and really getting some good work done is a great experience, but remains my second favorite aspect of the job. For me, the best part is the design phase before the real coding starts. At that point, everything is totally fluid and malleable. I’m making the decisions that I’m going to live with for the next few years, and putting some extra cleverness or flexibility into the system can have huge payoffs.
Something I’ve found very helpful at this phase is a “programming toolbox” — a simple list of good ideas and approaches to different problems. When I’m stuck on a problem, or trying to generate a new approach to something, it can be helpful to flip through the list. Most of the ideas won’t apply, but sometimes it will spark something novel. To keep this list from getting too unwieldy, I’ll post a few at a time as I write them up.
Let’s start with one of my favorite data structures. Wikipedia has all the details, but here are the key points you need to know. Bloom filters are a “space-efficient probabilistic data structure” that:
- Will answer either “maybe” or “no” to the question, “Have you seen item X?”
- Don’t store the actual item
- Are incredibly space efficient — it takes about 10 bits per stored item to have a 1% error rate
- Are tunable — the error rate drops 10x for every extra 5 bits you use
Just as an example, let’s imagine you want to want to cache copies of web pages on your disk. Before taking the hit to access the disk to check if something is cached, you could use a Bloom Filter to keep track of what you have. Storing 100K URLs in a bloom filter would use just 125KB of memory. By contrast, if you stored UTF-8 versions of them in a hash table, assuming URLs are about 50 characters long, you would need about 4.7MB of memory, not even counting hash table overhead. What is the downside? About 1% of the time, you will get an incorrect “Maybe” from the Bloom Filter and would go to disk looking for something that does not exist. If 1% is too much, you can double the memory usage to 250KB and the error rate drops to 0.01%. Note that if the filter tells you “No”, there is a 0% chance that the URL is in the system.
Are you defining any sort of message passing layer in your system? One thing you might want to consider is allowing unspecified header/value pairs to be added to the messages at any point later. Anything reading a message should simply ignore any headers it doesn’t recognize. This flexibility is baked into HTTP and has allowed all sorts of wonderful extensions beyond the basic HTTP spec. Adding this flexibility to your message layer from the beginning is trivial and it might let you solve a difficult problem later without completely upgrading all of your services.
Officially “Progress-based regulation of low-importance processes”, but known as MS Manners in this paper out of MS Research. The basic idea is that a background process can easily detect its impact on the user (or a more important process) if a few conditions are met. For processes with some degree of contention, the background process can establish a baseline of how long work should take. If it detects that recent work is taking longer, it can infer that it is impacting the more important process and wait until later to proceed. It sounds pretty simple, but the researchers have done the hard work in the paper and shown how to squeeze out the most gains. I think this is a powerful idea worth incorporating into any sort of maintenance job that runs against a production database.