Leadership & Management

Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers

Description
Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers
Published
of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 24, NO. 2, FEBRUARY 1998 125 Xception: A Technique for theExperimental Evaluation of Dependabilityin Modern Computers João Carreira, Student Member  , IEEE  , Henrique Madeira, and João Gabriel Silva, Member  , IEEE  Abstract  —An important step in the development of dependable systems is the validation of their fault tolerance properties. Faultinjection has been widely used for this purpose, however with the rapid increase in processor complexity, traditional techniques are alsoincreasingly more difficult to apply. This paper presents a new software implemented fault injection and monitoring environment, calledXception, which is targeted for the modern and complex processors. Xception uses the advanced debugging and performancemonitoring features existing in most of the modern processors to inject quite realistic faults by software, and to monitor the activation ofthe faults and their impact on the target system behavior in detail. Faults are injected with minimum interference with the targetapplication. The target application is not modified, no software traps are inserted, and it is not necessary to execute the target applicationin special trace mode (the application is executed at full speed). Xception provides a comprehensive set of fault triggers, including spatialand temporal fault triggers, and triggers related to the manipulation of data in memory. Faults injected by Xception can affect anyprocess running on the target system (including the kernel), and it is possible to inject faults in applications for which the source code isnot available. Experimental results are presented to demonstrate the accuracy and potential of Xception in the evaluation of thedependability properties of the complex computer systems available nowadays. Index Terms  —Fault injection, RISC processors, dependability evaluation, real time. ——————————   —————————— 1I NTRODUCTION OMPUTER  systems are used nowadays in an increasingnumber of applications that require high levels of de-pendability. In some cases our lives depend on them, suchas in traffic control, medical life support, or nuclear powerstation management applications. In other cases, such as banking, telecommunications and aerospace, failures cancause tremendous economic losses. Another novel areawhere dependability is increasingly important is high per-formance parallel computing. Parallel computers are usedto run computation intensive applications such as funda-mental physics/chemistry, and airplane/vehicle modeling,during large periods of time. Dependability is important toenable those long runs in spite of the increased probabilityof fault occurrence caused by the larger number of elec-tronic components in parallel computers.Generally, a dependable computer should be able to de-tect software or hardware errors, locate their srcin, and re-cover from those errors by using some kind of fault tolerancemechanisms. One important problem is how to evaluate andvalidate the effectiveness of the fault tolerance mechanismsembedded in these systems before production in order tocorrect defects and/or provide feedback for improvements.The validation of the dependability properties of a com-puter system is an intrinsically complex task and thegrowing complexity of both the hardware and softwaretend to make it even more difficult. The use of analyticalmodeling in actual systems is very difficult as the mecha-nisms involved in the fault activation and also in the errorpropagation process are highly complex; they are not com-pletely understood in most cases. Furthermore, the simpli-fying assumptions usually made to make the analysis trac-table reduce the usability of the results achieved by thismethod. Experimental verification by error logging impliesmonitoring the systems’ behavior until real faults occur andis not appropriate or feasible in most cases.Experimental evaluation by fault injection has becomean attractive way of validating specific fault handlingmechanisms and allowing the estimation of fault-tolerantsystem measures such as fault coverage and error latency[1]. Several techniques have been proposed for fault injec-tion. Generally they can either be based on specific hard-ware, system simulation, or software. Hardware techniquesinject physical faults in the target system hardware. Simu-lation techniques make use of a simulation model of thetarget system. Finally, a third solution is to emulate hard-ware faults and errors through software (Software Imple-mented Fault Injection, SWIFI for short).The motivation behind our work was the developmentof a flexible and portable tool to inject faults in the ad-vanced computers available today. These computers areusually built around high speed RISC processors whichtypically have high transistor densities, high clock frequen-cies, large internal caches, and advanced components suchas branch prediction units. Superscalar architectures alsohave replicated arithmetic units to execute machine in-structions in parallel. All these factors together pose newchallenges to fault injection. Traditional techniques such as 0098-5589/98/$10.00 © 1998 IEEE    •  J. Carreira, H. Madeira, and J.G. Silva are with the Departamento deEngenharia Informática, Polo II da Universidade de Coimbra, 3030Coimbra, Portugal. E-mail: {jcar, henrique, jgabriel}@dei.uc.pt.  Manuscript received 6 June 1996; revised 18 Sept. 1997.Recommended for publication by F. Schneider.For information on obtaining reprints of this article, please send e-mail to:tse@computer.org, and reference IEEECS Log Number 101221. C  126IEEE TRANSACTIONS ON SOFTWARE ENGINEERNG, VOL. 24, NO. 2, FEBRUARY 1998 hardware fault injection, although appropriate for simplerand older processors, are presently not easy to apply due tothe difficulties in controlling and observing the faults ef-fects inside the chips. Other techniques, such as simulation,are also difficult to apply because simulation models of these processors are very complex, and are often consideredcritical and confidential information by the manufacturers,thus being very difficult to obtain.However, while modern processors’ complexity diffi-cults the application of some fault injection techniques, itprovides several potential benefits to others. Within themillions of transistors that make up these processors(~5,000,000 for the PowerPC 604), architects included ad-vanced debugging and performance monitoring mecha-nisms. These new features are accessible to software(through privileged machine instructions) and thus can bedirectly used by SWIFI tools.This idea was behind the development of a new softwareimplemented fault injection and monitoring environment,called Xception. Unlike previous SWIFI tools, Xception caninject faults with minimum interference with the target appli-cation by directly programming the debugging hardwareinside the target processor. The sophisticated debugging ex-ception mechanisms available allow the definition of manyfault triggers (events that cause the injection of the fault),including fault triggers related with the manipulation of data.On the other hand, by using the performance monitoringhardware inside the processor, Xception can record detailedinformation on the target processor behavior after the in- jection of a fault. Some examples are the number of clockcycles, the number of memory read and write cycles, andinstructions executed (including specific information oninstructions such as branches and floating point instruc-tions) from the injection of the fault until some other subse-quent event, for instance the detection of an error (latency).Furthermore, by combining the exception triggers provided by the debugging hardware and the performance monitor-ing features of the processor, Xception can monitor otheraspects of the target behavior after the fault. For example, itis possible to detect if some memory area was accessed afterthe fault or if some program function was executed.Another important aspect is that, because Xception op-erates very close to the hardware (at the exception handlerlevel), the injected faults can affect any process running onthe target system including the kernel. It is also possible toinject faults in applications for which the source code is notavailable. In addition, the comprehensive fault triggers of Xception makes it suitable for the emulation of softwarefaults, as proposed by Christmansson and Chillarege [2]. Infact, the set of rules to emulate software faults proposed in[2] are difficult, or even impossible, to fulfill by SWIFI tools based on traps or hardware implemented fault injection.The target system is regarded by Xception as formed bythe processor, memory, and data/address buses. Injectedfaults can directly emulate physical faults affecting the fol-lowing internal target processor units: Data Bus, AddressBus, Floating Point Unit, Integer Unit, Memory ManagementUnit, General Purpose Registers, Branch Processing Unit, andMain Memory. Presently, Xception has been implemented ona Parsytec parallel machine built around the PowerPC 601processor and running the PARIX [49] operating system (aUnixalike operating system for parallel machines).The structure of this paper is as follows. Section 2 de-scribes related research in the fault injection field. Section 3discusses the advantages and the problems of the SWIFIapproach. The design and implementation of Xception isdescribed in Section 4. This section also presents the proc-essing debugging and performance monitoring featuresused by Xception, the fault model, and the mechanismsused to inject faults at the low level. Section 5 demonstratesXception’s capabilities and presents results obtained in pre-liminary experiments. Finally, Section 6 suggests some fu-ture work; Section 7 concludes the paper. 2R ELATED R ESEARCH Fault injection has been widely used in the past to evaluatethe dependability properties of systems or simply to vali-date specific fault handling mechanisms. This sectionsummarizes the most relevant work in the area. For othermore specific or detailed surveys in the area, the reader isreferred to [41], [42], [43], [44]. 2.1Hardware Implemented Fault Injection A popular approach consists of injecting physical faults intothe target system hardware. Several methods have beenused, such as pin-level fault injection [1], [3], heavy-ion ra-diation [4], power supply disturbances [5], and electromag-netic interferences [45]. These methods have the inherentadvantage of causing actual hardware faults, which may beclose to a realistic fault model. However, all these ap-proaches require special hardware and in some cases (e.g.,pin-level injection) the high complexity and high speed of the processors available today make the design of the re-quired special hardware very difficult, or even impossible.The main problem is not in the injection of the faults itself  but is related to the difficulties of controlling and observingthe fault effects inside the processor. Even the detection of the activated faults is very complex. For example, the injec-tion of faults in processor pins require the use of complexmonitoring hardware to know whether the injected faultshave produced internal processor errors or not [6]. Simi-larly, techniques such as heavy-ion radiation and powersupply disturbances require the target chip outputs to becompared pin-by-pin and cycle-by-cycle with a gold unit inorder to know whether the injected faults have producederrors inside the target chip or not. 2.2Fault Injection by Simulation Simulation based fault injection has also been proposed fordependability evaluation. In this approach faults are in- jected into a simulation model of the target system whichallows to control the timing, the type of fault, and the af-fected component in the model with more or less accuracydepending on the level of abstraction of the simulator. Oneof the advantages of this technique, that makes it appella-tive to system manufacturers is that it can be used early inthe design process. With a simulator it is also possible toinject very precise faults and collect detailed information ontheir effects. However this technique involves developing  CARREIRA ET AL.: XCEPTION: A TECHNIQUE FOR THE EXPERIMENTAL EVALUATION OF DEPENDABILITY IN MODERN COMPUTERS127 an accurate simulation model of the target system whichcan be very time consuming for complex systems. Further-more, the simulation models are not usually available fromthe manufacturers. Some recent examples of simulation- based fault injection tools can be found in [7], [8]. 2.3Software Implemented Fault Injection (SWIFI) SWIFI techniques alter the hardware/software state of thesystem using special software in order to cause the system to behave as if a real hardware fault occurred. One of the earlyapproaches is FIAT [9], which enabled the corruption of atask’s memory image. The selection of the fault location wasmade by the user at the application level and the physicallocation within the memory image was obtained from com-piler and loader information. Although this work providedvaluable results, it was not able to inject transient faults.The concept of failure acceleration was introduced byChillarege and Bowen in [10] where faults were injected bymodifying memory contents under software control.Another tool named DOCTOR [11] is capable of injectingprocessor, memory and communication faults on a distrib-uted real-time system called HARTS. Processor faults areinjected by modifying the applications executable image,specifically changing some instructions generated by thecompiler and inserting extra instructions.In the FERRARI [12] approach, the UNIX  ptrace function isused to corrupt the process memory image in run-time andinsert software trap instructions at the specific instruction ad-dresses where faults should be activated. This tool allows theinjection of transient faults and provided valuable resultsfrom experiments conducted on a Sparc workstation.Another tool named FINE [13] has been proposed to in- ject faults and monitor their effect by using a softwaremonitor to trace the control flow. However, this tool needsthe source code of the target application and causes a largeoverhead. DEFINE [14] is an evolution of FINE that in-clude distributed capabilities. It modifies the programs'executable image in order to emulate memory faults, e.g., by inserting software traps  at specific memory locations inthe text segment. DEFINE also enhances FINE by intro-ducing a modified hardware clock interrupt handler to in- ject CPU and bus faults with time triggers.Finally, in addition to hardware faults, DEFINE is alsocapable of injecting some kinds of software faults, i.e., soft-ware design/implementation faults. FTAPE [47] was usedto inject faults in three prototypes of a commercial fault-tolerant computer. FTAPE is part of a benchmark for char-acterizing the fault-tolerance of a system and it also in-cludes a synthetic program to generate CPU, memory andI/O activity. FTAPE injects faults in the CPU, memory andI/O, and can select the time and location of the fault ran-domly, or based on workload activity measurements. Thislast technique is known as “stress-based injection” and as-sures that faults are injected in components undergoinghigh activity.The injection of fault types specific to parallel and dis-tributed systems have also been a major concern in severalSWIFI tools such as DOCTOR [11], DEFINE [14], EFA [15],and CSFI [16]. These tools are able to inject faults in thecommunication subsystems of their target systems throughsoftware and have been used for several purposes, such asevaluating distributed diagnosis algorithms, the fault toler-ant capability of algorithms, or the overall effect of com-munication faults in parallel applications. 2.4Hybrid Fault Injection The hybrid approaches result from a mix of any of the pre-vious techniques. SWIFI was used along with some extrahardware in a recent version of FERRARI [17] to help in thefault injection process, and in HYBRID [18] to trace faultactivation and propagation in the target system. Anotherpossibility consists of mixing SWIFI with simulation [19] totake advantage of both the speed of the actual target proc-essor and the accuracy of low-level fault models. Finally, in[20] the dependability properties of the Motorola MC88100RISC processor were evaluated using a mix of software andsimulation techniques. 3SWIFI: P ROBLEMS AND S OLUTIONS The actual trend in fault injection seems to be the use of SWIFI tools. The advantages of SWIFI are manyfold: Theyuse real hardware and software, they are less complex andcostly, and incur in less development effort than the othertechniques. They are easily expanded (for new classes of faults), quite portable, and finally there is no problem withphysical interferences or risk of damaging the target systemas in physical fault injection. However, existing SWIFI toolsstill have several major problems which are discussed in thefollowing subsections. 3.1Impact in the Target System One of the most important problems of SWIFI comes fromthe fact that existing software fault injection tools haveconsiderable impact on the target system behavior, either because part or all of the code of the tool has to be exe-cuted in the target system (i.e., it becomes part of the tar-get workload) or because the target processor may have torun in trace mode. Previous research of different natures[3, 21, 22, 23] has emphasized the impact of the workloadon the performance of the fault handling mechanisms,which means that the software fault injection tools inter-fere with the results.In Xception the impact in the target workload is quitesmall in both time and space. Concerning time, the excep-tion that triggers the injection of the fault is programmed inthe processor debugging hardware before starting the tar-get application. Therefore, the processor is executed at fullspeed until interrupted by the trigger exception to performthe injection itself. The time spent in exception handlersmeasured in the current version for the PowerPC rangesfrom 1 to 5 µ sec, depending on the functional unit to affect.This is a value that can be accommodated in the time con-straints of many real-time systems, and thus makes Xceptionsuitable for use in these systems. Concerning space, withXception the target application does not need to be changed.In addition, the Xception modules resident in the target sys-tem (low-level exception handlers and the fault injectingcode) occupy as little as 30 kbytes.  128IEEE TRANSACTIONS ON SOFTWARE ENGINEERNG, VOL. 24, NO. 2, FEBRUARY 1998 3.2Fault Triggers Another problem of existing SWIFI tools is related with therestricted range of fault triggers. Faults are injected either bycorrupting the memory image of the application, by insertingtraps, or by replacing one set of instructions by another set of instructions. All these methods are related to instruction exe-cution, and no fault triggers related to data manipulation can be defined. In FERRARI [12], DEFINE [14], and FTAPE [47],faults can also be injected by defining a temporal trigger. Oneadvantage of this method lies in the fact that the fault is(almost) always injected, while it is not related with any spe-cific action of the target application. However, faults injectedin this way cannot be reproduced, because the system clockused is not accurate and the application execution usuallyhas time uncertainties. It is nevertheless worth noting thatthis might not be a problem when fault injection experimentsare aimed at fault tolerance coverage estimation.In Xception, a comprehensive set of fault triggers relatedto instruction execution, (some) data manipulations, andtemporal features are available. The temporal triggers areimplemented by using the internal timer available in mostof the modern processors. 3.3System Monitoring The target system monitoring is another problem of existingSWIFI tools. Monitoring is required either for detecting theactivation of faults or to collect relevant information on thefault impact. Only few proposals handle the monitoring is-sue, either by using extra instrumentation [18], by usingsoftware monitors [13], or by inserting trap instructions in theadequate locations [14]. The first method needs extra (andcomplex) hardware, while the other methods cause greatexecution overhead and do not achieve detailed monitoring.In Xception, the use of the dedicated performancemonitoring and debugging hardware inside the processorgreatly facilitates the monitoring of the target system in thepresence of faults. 3.4Accuracy One of the most long standing argument against SWIFI statesthat its accuracy, i.e., its ability to emulate device-level faults(generally considered as the real faults) is very reduced.In any fault injection approach it is important to guar-antee that the errors produced by injection are as close aspossible to the errors produced by real faults. As the ulti-mate goal of fault injection is the validation of fault han-dling mechanisms, the set of errors produced by more accu-rate injections will validate these mechanisms more accu-rately. In general, lower levels of abstraction provide highaccuracy at a higher cost, and higher levels of abstractionprovide various levels of accuracy at a lower cost. Specifi-cally for SWIFI, some attempts were recently made to dem-onstrate its accuracy, with reasonable success. An earlytool, developed with this goal in mind was called EMAX[24] and used a Zycad hardware simulator to inject faultsinto the gate/transistor level description of a circuit.Another work [25] showed that over 80 percent of gate-level fault manifestations (errors) do not lead to errors orcan be represented by SWIFI techniques. The author used asimulation model of a microprocessor in which gate-levelfaults were injected. An analysis procedure examined er-rors occurring from the injected faults and categorizedthem according to their ability to be emulated by SWIFI.A recent work [26] introduced a microprocessor error behavior function (EBF) that maps faults into errors on thefunctional level. The study concluded that pin level faultinjection was only able to emulate 9 to 12 percent of the bit-flip faults. On the other hand it showed that 98 to 99 per-cent of device level faults in the processor could be emu-lated by software implemented tools (SWIFI).Another recent work from Charles Yount at CarnegieMellon University clarified many issues concerning therepresentativity of SWIFI. To have the best of both worlds,ASPHALT [19] mixes SWIFI and Fault Simulation toachieve a higher accuracy at a low cost and provided veryencouraging results for the SWIFI techniques. 3.5Intrinsic Limitations Some processor resources and structures, such as the buscontrol lines and peripheral devices, cannot be directlyreached by SWIFI tools. For instance, no SWIFI injectedfault can cause errors in the low level bus control timings(although we can argue that their consequences can beemulated by data bus errors).In the same way, the injection of faults in the targetmemory is partially limited by external logic implementingparity checks or error detection and correction. That is, thememory content is corrupted by the faults but the errorscannot be detected by parity, as the parity bits are set ac-cordingly by the external logic.These are intrinsic problems of SWIFI for which we canonly envision one solution: the use of additional hardwaresupport (hybrid solution). 3.6Portability to Other Processors and Systems Another important problem with existing SWIFI tools istheir portability to other processors. SWIFI tools can largely benefit from modularity in design by separating the low-level and processor specific code from the higher-level faultinjection modules. The changes required when porting aSWIFI tool to another system can thus be restricted to thelow-level module (usually a device driver). This is the casewith, e.g., FTAPE [47] and also Xception.In Xception, the changes in the lower-level module com-prise the adaptation of the exception handling code (writtenin C) to the specific target processor debugging and per-formance monitoring features. While the exception han-dling code itself is rather simple, the integration with thekernel can pose some problems. In the current implemen-tation of Xception, this code was integrated with the PARIX[49] kernel much like a device driver.The basic requirement to implement Xception on otherprocessors is, however, the existence of the debugging andperformance monitoring features. We investigated severalcontemporary processor architectures to check about theexistence of these features with success.The HP Precision architecture [27] provides an optionalSFU (Special Function Unit) for debugging. It supportsseparate registers sets for data and instruction breakpointsallowing even more sophisticated fault triggers than thePowerPC. There are also dedicated instructions to ma-nipulate the debugging unit registers.  CARREIRA ET AL.: XCEPTION: A TECHNIQUE FOR THE EXPERIMENTAL EVALUATION OF DEPENDABILITY IN MODERN COMPUTERS129 The Pentium processor has a comprehensive set of per-formance counters similar to the ones existing in PowerPCand it also has four breakpoint registers for establishing breakpoints. Although some of these features are notdocumented, and are only available through a non-disclosure agreement with Intel, Pentium debugging, andperformance monitoring features have been reverse-engineered and published in [28].The Alpha AXP architecture [29] is a 64 bit load/storeRISC architecture designed with particular emphasis onclock speed, multiple instruction issue and software migra-tion. Although its debugging facilities are reduced, it in-cludes performance monitoring features like several regis-ters to count hardware events and perform an interruptupon counter overflow. It also includes an enable/disable bit for floating point instructions. All these features aresimilar to the ones existing in the MPC604.The MIPS R4400-R10000   processor family [30], althoughhaving reduced performance monitoring facilities, containfour special debugging registers implementing comprehen-sive debugging features similar to the ones existing in thePowerPC.The Power2 architecture [31] includes a monitor con-taining several counters for counting instruction executionand data storage events up to a maximum of 320 user-defined events. This includes counting the number of fetched, dispatched, and executed instructions, floatingpoint instructions, number, and type of storage operations,etc. It also includes an Instruction Match Register to countthe occurrence of specific instructions.It becomes clear from the short survey presented abovethat the concept of using the debugging and performancemonitoring features for dependability evaluation by faultinjection has a wide applicability. 4D ESIGN AND I MPLEMENTATION OF X CEPTION Xception consists of three modules, shown in dark gray inFig. 1: A kernel module, a fault setup module, and the Ex-periment Manager module (EMM). The kernel module is asmall (20 kbytes) module statistically linked with the kernelof the target system. It consists of the exception handlers(basically glue code) and the code performing fault injec-tion. A kernel incorporating Xception is provided to theuser and is used for all fault injection experiments insteadof the normal kernel. The fault setup module is a library of functions whose only task is to receive the fault parametersfrom the host (via message passing) and pass them to thekernel. Fault setup is accomplished by invoking the libraryfunction, StartXception() from any process in any processorin the target system. It can be invoked by the target appli-cation or from a dedicated process if the application sourcecode is not available (as is the case in Fig. 1). Finally, theEMM runs on a host system (presently a SUN Sparc) andprovides the user interface for fault definition, automaticfault injection experiment control, and collection of results.The target system of Xception is regarded as composed by the processor, system buses and memory. The processoris further divided in its functional units as is described inmore detail in Section 4.4.1. The target can be a single proc-essor system or a multiprocessor, the only difference beingthat multiprocessors need an additional parameter which isthe processor number to inject. The present implementationof Xception is targeted for a parallel machine based onPowerPC processors: the Parsytec PowerXplorer. 4.1Processor Debugging and PerformanceMonitoring Features Used by Xception The recently introduced performance monitoring and de- bugging features consist mainly of performance countersand breakpoint registers. The former count user definedevents such as load, store, or floating point instructions.The latter enable the programmer to specify breakpoints fora wide range of situations such as load, store or fetch from aspecified address or even some instruction types (e.g.,floating point instructions). These features are accessedthrough privileged instructions and are mostly used bydebugging and performance analysis tools.The present implementation of Xception is targeted forsystems based on the PowerPC processor family, more spe-cifically for the MPC601 [32]. The PowerPC will, therefore, be used as a case study. Table 1 shows the list of exceptionstypes of the MPC601 and MPC604 used by Xception. It isworth recalling that unlike other injection tools, Xceptionuses hardware exceptions and not software trap  instructions.The Decrementer exception is used by Xception to triggerfault injection after a user specified time (clock ticks), thusproviding fault trigger definition in a temporal way. RunMode and Data Access exceptions are used to define faulttrigger in a spatial way. For example, faults can be injectedwhen the instruction in a specific address is fetched or whenthe data stored in some address is accessed. Experimentsperformed by using the spatial method can generally be re-produced because they depend on a specific address. On theother hand, in the temporal trigger method, faults cannot bereproduced due to execution time uncertainties. Trace modeis used basically just to execute the machine instruction af-fected by an injected fault. This is because to inject transientfaults, the Xception methodology (explained in detail in Sec-tion 4.6) requires the control of the processor to be taken im-mediately after executing the corrupted instruction. It isworth noting that this exception is only used at the point of injection, because otherwise it would slow down programexecution in an unacceptable way. Fig. 1. Xception structure.
Search
Similar documents
View more...
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks