Crashing When Something Feels Wrong

I’m sort of lazy, so I really like the idea of code that continually checks itself by using assertions. I even like running production services with assertions turned on. To be clear, I’m talking about assertions that check for actual bugs in your code – not assertions that socket() didn’t fail. Still, crashing production servers is a contentious issue, but sometimes (hopefully rarely) it is the best thing to do. For something like FolderShare, crashing a server as soon as there is any hint of an error is vastly safer than possibly deleting someone’s files due to a bug. Of course, this introduces the risk that you could have multiple servers fail in a short amount of time, but you need to design for that case anyway.

I originally fell in love with assertions after reading Steve Maguire’s Writing Solid Code many years ago. After I saw how helpful they could be, I started to structure my code to make it more “assertable.” For example, I like state machines that use a table of valid transitions combined with assertions. This prevents anything I don’t explicitly anticipate from happening and is really helpful in a networked, asynchronous world.

But after developing a few long running services, I’ve started to perceive the need for a new type of assertion that is a bit higher level than a single conditional. I want something that will let me crash a service (and save a memory dump) when something “feels” wrong. For the moment, I call these “probabilistic assertions,” and I would have slept better while running FolderShare if I’d implemented them then.

Like all synchronization software, FolderShare had a few nightmare scenarios that I worried about all the time. If, due to a bug, the service told every client that its files were all deleted, I’d probably have to read a lot of nasty blog posts, and I would feel like crap for a few years. I would much rather crash all my servers and debug the problem with the system offline than take a chance on pissing off so many people.

Anyway, a normal assertion that looks at a single conditional wouldn’t help in this case. Asserting that no files have been deleted when a client connects doesn’t make any sense. But for a large enough sample size, I could assert that 80% of clients that connect shouldn’t have any files deleted. And I could probably assert that 95% of clients that connect shouldn’t have all their files deleted.

Functions like these cover most of what I’m thinking about:

Depending on your application, these functions could be implemented either as assertions or as triggers to alert an administrator. For alerts, you could really take this to another level by checking that incoming events are following a certain distribution for example. Granted that it’s a bit over the top, but I can’t help but imagine checking if a stream of incoming events obeys a Poisson distribution or if the sizes of certain data are following a Gaussian distribution.

Has anyone seen anything like this in the wild? I’d love to hear about it.

13 Responses to “Crashing When Something Feels Wrong”


  • Apple’s iPhone sync software has something in the same spirit – it prompts the user for confirmation (and shows details) if a sync operation will change more than a certain number or percentage of contacts on the phone or computer.

  • Cool. It would be great if somehow you could bind your “fuzzy” assertions to a NN.

  • Cannot think of any programs. But environmental monitoring systems use similar techniques. A system will have several redundant probes/detectors for a given factor and monitor all of them. For example, one temperature probe may read 300 degrees but two others may read only ~200 degrees. The system assumes the probe reading 300 could be malfunctioning and continues to operate as usual, until a second probe spikes. Then the system shuts down to protect what ever resource it is monitoring.
    I am sure you have already considered similar examples. But I think a traditional engineering view is often left out when thinking in terms of highly reliable software. I think the closest thing to what you described in terms of software are intrusion detection or flood protection. Your post has got me thinking of our current system and how to implement something similar to detect anomalies that could lead to potential problems. Thanks.

  • > I even like running production services with assertions turned on.

    The you might like this little Javac plugin, that turns all assertions into explicit exception throws, such that you can not turn them off any more.

    http://smallwiki.unibe.ch/adriankuhn/javacompiler/forceassertions/

  • Yay for automated monitoring software. Nagios (http://www.nagios.org/) does this for networks (and is extendable for some other things). At my old job we used Hobbit (http://hobbitmon.sourceforge.net/) to watch our Java server instances (memory usage, etc.). There’s no reason why these monitoring programs couldn’t be used to monitor internal program statistics, as long as those stats were made available.

    Generally you monitor from your internal network, and then provide some hook for the monitor to get information that’s only accessible from there. (SSH or a limited-access URL, etc.)

    Monitoring programs are super-powerful and generally complex. Check them out — it’s a good skill to have when working with production software.

  • Could another type of assertion use spam-filter-like algorithms to detect abnormal (‘spam’) situations? Train a filter to recognise odd situations.

  • Hello Tom,

    My bank calls me whenever they see ‘suspicious’ activity on my accounts such as a flurry of purchases or other transactions that doesn’t fit in with the normal usage patterns. Airport security in some places are moving toward behavioral profiling, where they watch for people whose actions in the airport or other details doesn’t fit in with the expected pattern are identified for additional inspection. What you’re proposing happens a lot in the real world so it makes sense that it would work for software management as well! Great idea; good luck with its implementation.

  • You might enjoy reading this paper, “Pip: Detecting the Unexpected in Distributed Systems”. Among other things, the Pip system supports the kind of aggregate assertions/checks that you mention.

    There’s source available at http://issg.cs.duke.edu/pip/

    http://issg.cs.duke.edu/pip/nsdi06preprint.pdf
    (from the abstract)
    We present Pip, an infrastructure for comparing
    actual behavior and expected behavior to expose structural
    errors and performance problems in distributed systems.
    Pip allows programmers to express, in a declarative
    language, expectations about the system’s communications
    structure, timing, and resource consumption. Pip
    includes system instrumentation and annotation tools to
    log actual system behavior, and visualization and query
    tools for exploring expected and unexpected behavior.

  • About generating the core dump without crashing with a sigsegv, you could use Google’s Core Dumper
    (https://sourceforge.net/projects/goog-coredumper/)

    “A neat tool for creating GDB readable coredumps from multithreaded applications — while the program is running.”

  • Your approach to probabilistic assertions is very interesting. I am using a similar approach on some of my trading algorithms. Basically, if some statistical conditions show that “shouldn’t be there” the system gets ready to dump the trade. However, to make it perfect, I think the system should keep track of every time the “feeling” was correct (i.e., the system did crash or the stock, in my case, moved in the wrong direction). After that, we would need a Bayesian calculation of probabilities before crashing the system.
    I haven’t included in my systems because it is too much work, but I still think that is something I will do in the future.
    Maybe you may move your system one step forward and record the “feeling” hits and create a simpler algorithm for actual crashing based on that.
    At the beginning you would never crash the system, but as you get more data you would crash the system more selectively.
    My 2 cents

  • I once ran a messaging system which received about a million records a day. I also ran a document generation system that FTP’d anywhere from 10,000 to 60,000 files a night. Both systems had multiple, custom built monitors which could page developers if problems arose.

  • My suggestion is logging combined with log-analysis tools. You *could* monitor processes (like Erlang style concurrrency) as well.

    With logs, you just output your info into a file (or db table) and then periodically analyze the data for patterns. These patterns could be trained (e.g. Bayesian filtering) or matched against known problem sequences (e.g. a state machine that acts like a code/regexp parser).

  • All this logging and analysis, in a production service, sounds like it would cost a lot of CPU time and could easily be overkill.

Leave a Reply