Following up on my first post, What should be in your programming toolbox, here are a few more ideas from my list:
Wikipedia has more details, but essentially, an unrolled linked list is great because it will give you better performance than a regular linked list, is more cache friendly, and will probably have much less overhead. The basic idea is to store an array of elements at each node rather than a single one. This keeps your pointers closer together, which will make your cache happy when you are iterating through your items.
To compare the overhead for an unrolled list, think for a second about a regular linked list node. It probably looks something like this:
Assuming you’ve got 4 byte pointers, each node is going to take 8 bytes. But the allocation overhead for the node could be anywhere between 8 and 16 bytes. Let’s go with the best case and assume it will be 8 bytes. So, if you want to store 1K items in this list, you are going to have 16KB of overhead.
Now, let’s think about an unrolled linked list node. It will look something like this:
Therefore, allocating a single node (12 bytes + 8 bytes of overhead) with an array of 100 elements (400 bytes + 8 bytes of overhead) will now cost 428 bytes, or 4.28 bytes per element. Thinking about our 1K items from above, it would take about 4.2KB of overhead, ,which is close to 4x better than our original list. Even if the list becomes severely fragmented and the item arrays are only 1/2 full on average, this is still an improvement. Also, note that you can tune the array size to whatever gets you the best overhead for your application.
I’ve used unrolled link lists before, but I’ve never used this slightly more exotic structure. I’m just very fond of low overhead data structures that are built for storing large amounts of information, and this is a clever one. VEB trees have good overhead characteristics, but they are also particularly fast — they implement all operations in O(log m), where m is the key length. So, if you use an optimal key length for your number of items, you can do everything in O(log log n), which is obviously an improvement over your standard O(log n) binary tree. The Wikipedia page has more details, but sadly, I haven’t been able to find any sort of library implementation.
If you want to be extra cautious about removing items from a database or you just want to support undelete, consider using Soft Deletes for your databases. Just add a “TimeDeleted” column and consider the row deleted when the time is non-zero. You can use a clean up script to actually remove the item after some amount of time has passed.
This technique isn’t without some hassles, though — you may have to add some extra logic or include the TimeDeleted column as part of a unique index if you want to allow the user to insert a duplicate item back into the table.
I’m not sure if I would ever actually use skip lists because it seems like a cache unfriendly data structure. However, I still like the idea. In a nutshell, you start with a normal, sorted linked list. Then, on some nodes you add pointers that let you skip forward more than one node at a time. With a proper distribution, this allows for much better than O(N) when searching, inserting, and deleting items.
If you need to store some structured information on a disk, you need to have a really good reason to implement your own. Using Berkeley DB gets you a simple, well-known and well-debugged disk-backed store without a lot of overhead. Plus, you get things like hot backups and locking.
Note that Amazon is using Berkeley DB as one of the underlying stores for Dynamo. If you haven’t read the Dynamo paper (or for that matter, the BigTable and GFS papers), you really should. I didn’t mention it before, but BigTable uses one of the ideas from my last post, Bloom Filters.