Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code

Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat University of California San Diego
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat University of California San Diego Abstract Modern software model checkers find safety violations: breaches where the system has entered some bad state. For many environments however, particularly complex concurrent and distributed systems, we argue that liveness properties are both more natural to specify and more important to check. Liveness conditions specify desirable system conditions in the limit, with the expectation that they will be temporarily violated, perhaps as a result of failure or during system initialization. Existing software model checkers cannot verify liveness because doing so requires finding an infinite execution that never satisfies one or more liveness properties. We present algorithms to find liveness violations with high probability and the critical transition that moves the system from an indeterminate state, where liveness can still be achieved despite a temporary violation, to a dead state, where it becomes impossible to ever achieve liveness. We present MACEMC, a software model checker that implements our algorithms and finds complex liveness errors in implementations of PASTRY, a reliable transport protocol, and an overlay tree. 1 Introduction There has been substantial recent progress in employing model checking to find bugs in unmodified systems implementations [15,27]. Model checking can improve code quality by finding bugs violating system-level correctness properties by systematically analyzing carefully constructed executions of the system. These runs correspond to the exhaustive exploration of all possible interleavings of events and other sources of non-determinism such as thread scheduling, message transmission, and I/O latency up to a certain maximum number of events. Existing software model checkers find violations of safety conditions. For example, a developer may specify that a pointer should never be dereferenced after it is freed or that a node s routing table should never accept advertised routes that contain the node itself to avoid loops. Thus, finding bugs with model checking currently requires the programmer to have intimate knowledge of the low-level actions or conditions that could result in system failure. We contend that for complex systems the desirable behaviors of the system may be specified more easily than identifying everything that could go wrong. Of course, specifying both desirable conditions and safety assertions is valuable; however, current model checkers do not have any mechanism for verifying whether desirable system properties can be achieved. Examples of such properties include: i) a reliable transport eventually delivers all messages even in the face of network losses and delays, ii) all nodes eventually join an overlay, and, iii) a distributed tree partitioned into two halves can eventually merge. These global liveness requirements specify that, in the limit, the system should achieve some particular condition. Unfortunately, identifying liveness violations poses a much greater challenge than finding safety violations. For safety properties, simply finding a state that violates the given condition proves the violation, whereas temporary violations of liveness conditions are expected and quite common. For instance, flagging a liveness violation in a temporarily partitioned network is premature. Rather, the model checker must verify that the system has entered a dead state where it becomes impossible to ever recover to a live state. Model checkers that operate over finitestate specifications locate liveness violations by finding a cyclic path without any live states. This approach is infeasible when checking real systems (that have infinite state spaces) because such repeating sequences typically occur beyond the horizon that can be exhaustively searched (a few dozen system events) and because complex system interactions (e.g., a timer unrelated to an error firing periodically) means that a repeating sequence may not be found even if a search to sufficient depth were possible. The primary contribution of this work is a model checker and set of algorithms capable of finding both safety and liveness violations in systems implementations. To find liveness violations, we first perform an exhaustive search to some maximum depth (limited by CPU time, memory, or both). From this periphery we then execute 1 strategically chosen random walks to sample the space beyond that which can be exhaustively searched to locate potential liveness violations. From this initial state we perform additional sampling to ensure a low rate of false positives in the set of violations reported to the developer. Next, we address the difficult task of finding the actual system error that causes a particular liveness violation. Unfortunately, the initial state where the system became non-live very likely has little to do with the error; rather, a much later operation moves the system into a dead state from which recovery becomes impossible. To limit the tedium of wading through hundreds of states to find the error, we developed an algorithm to isolate the critical transition the point after which the system can never recover likely to be the source of the error, an interactive debugging tool, MDB that allows forward and backward stepping through global events and per node state inspection, and automatically generated event graphs to visualize system behavior. We have built a fully functional model checker, MACEMC, for analyzing safety and liveness properties of unmodified, complex systems implementations. We have run MACEMC on a number of systems, including a reliable transport, an implementation of PASTRY [30], and an overlay tree. All of these systems are mature pieces of code with the easy bugs previously eliminated through live deployment. We have used MACEMC to find 33 bugs across these systems. After developing MDB and event graph visualization, we reduced the human time required to find and fix liveness violations from a few hours to minutes. 2 Background Model checkers find bugs by isolating executions that violate some desired property. Previous efforts to model check systems implementations have found bugs corresponding to violations of safety properties such as null-pointer dereferences and deadlocks. Safety properties specify undesirable conditions that must never occur, rather than specifying what the system ought to be doing. Thus, while acknowledging that safety violations provide valuable checks for known error conditions, we argue that finding liveness violations is more useful and more fundamental to building robust systems. That is, liveness properties often more easily and succinctly capture the global, system-spanning correctness requirements for systems in steady state; systems may diverge temporarily, but they should achieve the desirable conditions after an intransient condition subsides. To illustrate the difficulty of locating liveness violations, consider the example in Figure 1 corresponding to model checking the state space of a system. This example uses a liveness predicate P as well as some safety conditions. The live states are consistent, i.e., satisfy P. Indeterminate states do not satisfy P but may after some further execution. Dead states do not satisfy P and can never in the future recover to a live state. Finally, unsafe regions violate some safety property; a system should never enter an unsafe region and it is usually easy to verify whether it has done so. In this figure, the model checker performs a bounded depth first search (BDFS) starting from some initial state at the center of the figure. The checker could flag the indeterminate states at the periphery of its exhaustive search as potential violations. Unfortunately, most of the states at the periphery of this search are indeterminate, as it often takes the system many more steps than the maximum search depth to satisfy the liveness condition. Instead of checking that the system reaches a live state eventually, we could restrict ourselves to checking timebounded liveness properties, wherein we require that a live state be reached within some bounded number of transitions. For our target systems, the number of transitions required to reach the live state is often quite large. For example, many messages may need to be exchanged before all nodes join an overlay. Thus, for a bounded liveness violation to correspond to a real liveness violation, the time bound must be large enough, often thousands of transitions even for distributed systems consisting of two nodes, to give the system sufficient opportunity to reach a live state and to prevent a large rate of false positives corresponding to paths not given sufficient opportunity to achieve liveness. Of course, it is infeasible to exhuastively search to these required depths. To further highlight the relationship between safety and liveness violations, consider the following real example. A liveness property for an overlay tree may state that eventually all nodes form a spanning tree. This sufficiently general property applies to a variety of tree construction protocols. However, the liveness condition that the nodes form a tree may be violated for long durations, e.g., at system startup and as a result of failure, and thus, cannot be expressed as a safety property. Any safety checker [15, 27], would have to flag states not satisfying the liveness condition as violations, which would result in a flood of false positives, and the majority of those states would be able to recover to a live state. We ran our model checker on an implementation of an overlay tree to find violations of this liveness property. We found a path to a dead state characterized by two disjoint trees that have no hope of eventually forming a spanning tree. This error resulted from a recovery timer firing during the unanticipated condition that the node was busy joining a potential parent, after which the remained unscheduled. The unscheduled recovery timer left no mechanism to merge the disjoint trees. After understanding this error, we realized a new per-node safety property: the re- 2 covery timer must always be scheduled when the node is in a non-initial state. Using this safety condition, the model checker located two additional errors. In our experience, fixing a global liveness violation often suggests additional per-node safety conditions. These safety conditions help isolate yet more errors, but typically would not be obvious a priori. We believe that an attractive model for identifying errors starts by writing highlevel liveness properties for desirable system states in the limit. Errors resulting from these violations often provide the insight necessary to add safety properties. While liveness properties are more convenient to specify, finding and safety violations is typically easier because safety properties often do not involve complex global state and because the error is usually close to the operation that triggers a safety violation. While previous work shows how to model check liveness properties of mathematical system descriptions (typically compiled into finite-state models), we develop new techniques to find and fix liveness errors in the implementations. System implementations contain not only flaws in high-level algorithms or protocols, but also subtle errors due to specific implementation choices. A software model checker capable of finding both safety and liveness violations requires addressing several key challenges. We outline these challenges here before discussing our solutions in subsequent sections. How to systematically explore all executions of implementations? The model checker must read a system implementation and systematically explore all executions. The system to be analyzed spans multiple nodes executing concurrently and communicating via messages sent over a network. Each node executes an entire computation stack that interacts with a networking layer below and an application layer above. We must isolate all points where the code makes a choice based on a non-deterministic or random input, and systematically explore what might have been had the system made different choices. For distributed systems, sources of non-determinism include different ways in which: i) scheduling affects the order nodes execute, ii) applications interact with the system, iii) the physical network delays or drops messages, iv) scheduled timers fire, v) nodes fail, vi) the system makes random number requests, and vii) all of the above may be interleaved. We require a way to expose to the checker the sources of non-determinism, and a technique to systematically explore the multiple potential resolutions to individual sources of non-determinism, and thus, the ways the system may execute. How to mitigate state explosion? By searching all resolutions to sources of non-determinism, a model checker systematically explores different executions corresponding to the temporal interleaving of events. The space of all possible event orderings grows exponentially with the Figure 1: Exploring a system s state space for safety and liveness violations. depth of the execution and the number of nodes in the system. This state-explosion problem can severely limit the applicability of model checking to short executions of small distributed systems. However, many subtle problems only manifest after the system has run sufficiently long to enter steady state (e.g., after all nodes have initially joined). Finding such bugs deep in the state space requires techniques to mitigate the state-explosion problem and to prune either uninteresting sequences or sequences that have already been explored. How to tell the time? Distributed systems make extensive use of time, for example, to determine the latency between two nodes (timestamps), to tell if another node is still alive (timeouts), to delay performing certain tasks (timers), or as an easy source of monotone values (ordering). A model checker cannot use the actual implementation of the time system calls when analyzing the code because it would make replaying an error path impossible as subsequent runs will return different values. Furthermore, by using actual time the model checker would be restricted to exploring only states corresponding to the particular values of time in a given run. 3 Model Checking with MACEMC MACEMC runs code written in the MACE programming environment [2]. We briefly outline the aspects of MACE relevant to model checking. MACE introduces syntax to structure each node as a state machine with transitions corresponding to events such as the message reception, timers firing, etc. Most of a MACE implementation consists of C++ code in appropriately identified code blocks describing system state variables and event handlers. The MACE compiler outputs C++ code ready to run across the Internet. MACEMC model checks this resulting C++ code. Our approach should thus be general to any system implementation and could be incorporated into other software model checkers. In this paper, we focus on apply- 3 ing model checking to distributed systems where complex protocols and asynchronous environments make model checking, especially for liveness properties, particularly valuable. Relative to state-of-the-art model checkers [15, 27], leveraging MACE code principally gives us the advantage of negating the need to modify source code to isolate the execution of the system, e.g., to control network communication events, reasoning about time, and other sources of potential input to the system. We designed MACE with model checking in mind, including programming constructs that allow MACE-implemented systems to be model checked without modification, dramatically improving the ability of the typical programmer to leverage the benefits of model checking. Note however that our techniques and algorithms are general to a variety of software model checkers. 3.1 Systematically Exploring Executions We take an approach similar to Verisoft [15] and CMC [27] where we use a driver application suitable for simulation and load different simulator-specific libraries, specifically for random number generation, timer scheduling, and message transport, so that our model checker can explore a variety of event interleavings for a particular system configuration and input condition. Possible event interleavings include the relative ordering of: i) executions of individual nodes in the system, ii) application events, iii) message delivery, and iv) timers and other scheduling events relative to message delivery. To exhaustively explore different event interleavings of the system, we must ensure that the model checker controls all sources of non-determinism in the system. Hence, we ensure that the system calls Choose, the standard MACE interface to randomness, when making a non-deterministic choice. The result from Choose (described below) determines the next branch in the execution. The model checker executes the actual system code to the next non-deterministic event decision point, and by iteratively feeding the system different sequences of values at the choice points, it can exhaustively explore executions corresponding to different event interleavings in a depth first manner. When running across a real network, Choose simply makes a call to drand48(). While the model checker performs a search, Choose deterministically returns the next number in the search sequence by picking a number from the input range. The Choose implementations used by the model checker internally record the sequence of values returned during the search and random walk. By saving this sequence, MACEMC can track which sequences to subsequently explore and can replay a given execution path by pre-loading Choose to return a specified number sequence. One benefit of MACE is that the programmer need not painstakingly isolate the sources of non-determinism and replace them with these special calls. MACE provides language support and common libraries for network communication and scheduling timers, two of the principal sources of non-deterministic asynchrony in distributed systems. For example, when checking a UDP message queue to determine the next event, there may be a variety of message orderings possible that could affect the behavior of the system. Our simulated network layer calls Choose with the number of available messages as input and uses the returned value to decide which message to process next. Our modular decomposition moreover allows MACEMC to investigate how the system would behave on networks where, for example, many packets are lost or reordered without having to find a real network with those properties. To explore the robustness of the target code, MACEMC may choose to drop certain messages or reset socket connections with some given probability. A simple driver application that initializes the system and carries out some task comprises the final component necessary to model check the system. In our experience, simple tasks, such as joining an overlay tree and sending a message from the root to all participants, are sufficient to uncover a significant number of bugs because of the variety of corner cases and failure conditions exercised by the model checker. Algorithm 1 MaceMC Simulator Reset event simulators Reset Nodes Initialize Choose with next sequence for step = 0 to maxstep do Signal divergence monitor thread readyevents = set of pending node, event pairs if readyevents is empty then break; node, event = readyevents[choose( readyevents )] Simulate event on node if a safety property is not true then return SAFETY VIOLATION FOUND else if all liveness conditions satisfied then break else if a duplicate state-hash reached then break if step maxstep then return INDETERMINATE STATE FOUND MACEMC searches execution paths
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks