Public Notices

Experimental Evaluation of the Fail-Silent Behaviour in Computers Without Error Masking

Experimental Evaluation of the Fail-Silent Behaviour in Computers Without Error Masking
of 20
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  1 Abstract Previous work has shown that using onlysimple behavior based error detectionmechanisms invisible to the programmer (e.g.memory protection) the percentage of  fail-silent violations can be higher than 10%.Since the study of these errors has shown that they were mostly pure data errors, in this paper we evaluate the effectiveness of softwaretechniques checking the semantics of the datasuch as ABFT and Assertions to detect theseremaining errors. The results of injecting physical pin-level faults show that these testscan prevent about 40% of the fail-silent modelviolations that have escaped to the simplehardware-based error detection techniques. Moreover, the analysis of the remaining errorshas shown that most of them remained undetected due to short range control flowbreaks. When very simple software-based control flow checking was associated to thesemantic tests, the target system behaved — without any dedicated error detectionhardware— according to the fail-silent model for more than 98% of all the faults injected. Keywords : fail-silent behavior, softwarechecks, ABFT, assertions, control flowchecking, pin-level fault injection,dependability, fault-tolerance, experimentalvalidation. 1-Introduction A computer is  fail-silent   if it producesonly correct results, i.e., if incorrectresults are generated as a consequenceof a fault then the computer will notoutput them [Powell88]. The fail-silentmodel plays an important role in thedesign of reliable distributed andparallel systems in which it is requiredthat an error within one node does notpropagate to the other nodes in thesystem. The fail-silent concept is alsovery useful in the design of industrialcontrol computers and embeddedsystems (in this field fail-silent systemsare usually known as integer systems [Kirrman87]), as it is important toprevent the output of erroneouscommands to the physical processunder control.Traditionally, fail silent computers areimplemented by using massiveredundancy. Errors are detected bycomparison of the results produced byduplicated modules (either hardware orsoftware) and in case of disagreementthe results are not output (e.g.messages are not sent in the case of adistributed system). The main problemof this approach is the very highoverhead associated with theduplication (either by hardware or bysoftware).Recent work [Madeira94] investigatedthe possibility of achieving fail-silentbehavior without incurring the highcosts of duplication techniques. Theproposed approach consists in asimplex (traditional) computerenhanced with a comprehensive set of low cost error detection techniquessuch as memory access error detection,control flow error detection, watchdogtimer, illegal instruction detection,error capturing instructions, etc. Theidea behind this scheme is that if theerrors caused by a fault are detected intime it will be possible to stop theerroneous computer behavior, thuspreventing the fail-silent modelviolation. Fault injection results haveshow that the system behaved Experimental Evaluation of the Fail-Silent Behavior in Programs withConsistency Checks  Mário Zenha Rela, Henrique Madeira, João G. Silva Department of Computer EngineeringUniversity of Coimbra - Portugal{mzrela, henrique, jgabriel}  2 according to the fail-silent model for97.7% to 99.4% of the injected faults.Although this results are interesting,especially taking into account the lowcost of the used error detectionmethods, their use is rather limited asthey strongly rely on signaturemonitoring techniques, which cannotbe incorporated in the systems basedon the complex processors availabletoday. In fact, the only chance of usingsignature monitoring techniques incomplex processors is to include thespecific hardware required by thesetechnique inside the processor IC, andthe fact is that none of the existingcommercial processors are ready forthis techniques.In this paper the same basic approachfor implementing low cost fail-silentnodes is investigated in a differentdirection. Unlike the researchmentioned above, all the errordetection techniques used in thepresent research to achieve fail-silentbehavior exists in off-the-shelf computers. On the other hand, theseerror detection methods have beencomplemented by a set of softwareerror detection techniques especiallythought for the detection of data errors.The reason for this decision was thefact that most of the fail-silentviolations observed have been causedby pure data faults, i.e., faults affectingonly the manipulated data.In the next section we describe theerror detection mechanismsimplemented. Section 3 is devoted tothe fault-injection process. In thesucceeding sections the results of theexperiments are presented anddiscussed. Finally, some concludingremarks and ongoing work arepresented in section 7. 2-Error Detection Software Error Detection From the research mentioned above itcould be concluded that most of thefaults that caused fail-silent violationsare characterized by the followingfeatures:• Affect mostly data cycles(read/write access to the datasegment).• Have a short duration (1 or 2memory access cycles).• Have been mainly injected in thedata pins of the processor.This results suggest that to raise evenfurther the fail-silent behavior, errordetection techniques based on thesemantic verification of the datamanipulated by the programs arerequired. Data manipulation andsemantic checking need the notion of correctness, which can be provided bythe use of assertions , an invariantrelationship between the variables of aprogram, written as a logical statementand inserted at different points in theprogram [Andrews79], [Mahmood83],[Leveson83]. These assertions can bewritten from the program specificationor using some property of the problemor algorithm.There are several difficulties using thisapproach: assertions are nottransparent to the programmer, whichmay be a serious limitation in manysituations, namely due to theprogramming effort required.Moreover, it is felt that itseffectiveness depends largely on anumber of different factors such as thenature of the application, theprogrammer’s ability, etc., verydifficult to quantify, and requiring adeep understanding of each newapplication as it is developed.An approach that guarantees thecorrectness of the data manipulated bythe programs that overcomes thislimitations is  Algorithm Based Fault   3 Tolerance  (ABFT) first introduced in[Huang84]. In this technique checksumencodings are embedded in thecalculation for fault detectionpurposes. The encoding is done suchthat the application operations preservethe checksum structure. This checksumis verified by the application. If adifference between the stored andcomputed checksums is detected, afault has occurred. This method hasbeen particularly successful for matrixoperations, but a number of ABFTschemes have been proposed indifferent computations (e.g., Fouriertransforms [Malek85], [Reddy90], andmatrix equation solvers [Luk85]).Since at the moment few algorithmsusing this method are available, one of the main problems using ABFT is itslack of generality. It is well suited toapplications using very regularstructures, thus its applicability islimited to this type of problems.For problems such as the generation of random numbers where it is impossibleto find feasible checks based on thesemantic value of the data, reexecutionremains as the only practical errordetection that can be used, based solelyon software.Since very different software checkscan be used, depending on the problemunder study, and in order to obtainmeaningful results we used a set of tests representative of the differentcategories of software checks that canbe used: • Checks based on the regularity of the structures used. • Checks based on the externalspecification of the problem. • Reasonableness checks on theresults produced. • Checks based on the internalstructure of the code. • Reexecution when no otherapproach is feasible.Each of these tests has been applied toa different benchmark reflecting thespecific programming situation whereit would be more appropriate. Theprogram executed is formed by thesebenchmarks (Figure 1). Thebenchmarks and each particularsoftware check are presented in Table1 and described below. AleaString searchQuick sortMatMultSieveCRCSends a 16 bit signature to the outsideInitialization Figure 1-Program executed by the targetsystem • MatMult  is a matrix multiplicationprogram. This is the typical situationwhere very regular structures andoperations are used. For error detectionpurposes we used  Algorithm Based Fault Tolerance  as described in[Huang84]. • QuickSort  is a classic sortingalgorithm. The software check usedwas a test based on the externalspecification to assure that the inputarray was correctly sorted at the end of the routine. More effective tests couldhave been used such as the order-sumassertion [Saxena94] but we favouredsimpler and thus faster tests. • Search  is a simple searchingalgorithm. It looks for a given string inthe global array where all the previousresults are stored. The position wherethe string is found, or a “-1” if not  4 found, is appended to this array.Reasonableness checks such aschecking the index found against thestring limits (or "-1") and confirmationthat the reference string is effectivelyat the index location were the testsapplied. • Sieve  is the Sieve of Erathostenes prime number generator. In this case aset of simple low-level tests based oninvariants found in the routinevariables was used. • Alea  is a random number generator.Since it is impossible to find feasiblechecks on the semantic value of thedata, (the algorithm is essentially anarithmetic expression), we usedreexecution for error detection. • CRC  calculates the cyclicredundancy check of an array with theoutput from all the benchmarks used ineach experiment. Since this algorithmis also an arithmetic expression we usereexecution. This routine is a specialcase in the pool of benchmarks since itis used to generate the final signatureof the results produced by allbenchmarks calculated at the end of each program cycle. This signature(considered as the final result) is sentto the outside by a parallel output port.Since the signature corresponding tocorrect results is known in advance itis easy to check whether the signatureis correct or not. In this way it ispossible to measure the percentage of faults that cause the computer toproduce wrong results, i.e. the faultsthat cause the computer to violate thefail-silent model. The circuit thatchecks the result signature also detectsthe situation in which the target systemdoes not output the signature (systemcrash).Since the benchmarks and theirassociated software checks are runningon a physical machine, they benefitfrom the intrinsic hardware errordetection capabilities. It must be notedthat since this bottom error detectionmechanisms cannot be deactivated,only incremental  coverage figures canbe obtained with any software basederror detection technique. We shallnow describe this bottom hardwareerror detection layer. Hardware Error Detection In this research the target system usedis a commercial VME bus computerbased on the MC68000 processor 1 . Allthe hardware based error detection isobtained solely from the intrinsicbehavior checking that is performed bythe computer as it executes. This rawerror detection is present in everymodern computer, thus no dedicatederror detection hardware is required.For the target system under study thesemechanisms are the following: • µP - MC68000 processor built-inerror detection mechanisms. Thisprocessor has several internal error  1 Force ®  SYS68K/CPU-6.  BenchmarkDescriptionS/W Check used  AleaRandom number generationReexecutionMatMultMatrix MultiplicationABFTQuickSortSorting of an array(ai >= ai+1)SearchFind string in an array(if FOUND confirm)SievePrime number generator(code based)CrcCrc generation of resultReexecution Table 1-Benchmarks and their associated software checks  5 detection mechanisms [Osborne83].The most relevant are the detection of accesses to non-implemented memory,fetch of invalid instructions, unalignedinstruction fetch, and unaligned wordaccess. • MEM - Memory Access ErrorDetection. This is a set of errordetection mechanisms similar to thememory protection features of a typicalmemory management unit. Thefollowing mechanisms are considered:AUM - Accesses to unusedmemory;ECS - Error in the code segmentaccess (error if it is not aninstruction fetch);EDS - Error in the data segmentaccess (error if it is not a data reador write);AIM - Accesses to unimplementedmemory (i.e., not physicallypresent). • WDT  - Watchdog timer. Traditionalimplementation of a WDT by means of a hardware programmable timer.Since the memory access errordetection (MEM) strongly depends onthe memory usage it is important toknow what was the memory map. It isshown in Figure 2. It will be essentiallythe same in the succeedingexperiments since software checks ingeneral involve only a slight memoryoverhead. 3.Fault Injection Experimental evaluation by physicalfault injection has become an attractiveway of assessing the effect of faults incomputers and validating specific faulttolerance mechanisms. Among thevarious techniques available forphysical fault injection (heavy-ionradiation [Gunneflo89], power supplydisturbances [Miremadi92], and pin-level [Arlat89]) we decided to use pin-level fault injection as it enables greatcontrol in the fault injection processand allows the injection of repetitivefaults (the same fault always cause thesame impact on the target system).This latter feature is very important asis often necessary to repeat theinjection of specific set of faults inorder to understand the impact of some"odd" faults on the target system. Inthis research we used the RIFLE pin-level fault injector. This tool wasalready presented in detail in[Madeira94], thus only a very shortdescription is provided here. The RIFLE Fault Injection Tool RIFLE is a pin-level fault injectorcapable of injecting faults in relativelycomplex processors. The leading ideaof RIFLE is to combine trigger andtracing techniques traditionally usedin digital logic analyzers with the logicrequired for the pin-level faultinsertion. The result is a system able toinject practically all types of pin-levelfaults, and capable of recordingextensive information on the targetprocessor (and system) behavior afterthe injection of each fault. This tracinginformation is used for the complete 400h8000h8e04h969ah14c00h15000hInterrupt VectorsCodeDataStackFFFFFhTop of memoryAddress Space: Total Memory: Code Segment: Data Segment: Stack: R/WR/WR/OR/O  4 x16 Mbyte 768 Kbyte 3588 bytes 2198 bytes 1024 bytes Figure 2-Target System memory map(not to scale).
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks