Arts & Culture

A Solution to Single Point of Failure Using Voter Replication and Disagreement Detection

Description
A Solution to Single Point of Failure Using Voter Replication and Disagreement Detection
Categories
Published
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  A Solution to Single Point of Failure Using Voter Replication and Disagreement Detection A. Patooghy  1 , S. Gh. Miremadi  2 , A. Javadtalab  1 ,M. Fazeli 1 , N. Farazmand 1  Department of Computer Engineering, Sharif University of Technology  Azadi Ave. Tehran, Iran 1 {patooghy, javadtalab, m_fazeli, farazmand}@ce.sharif.edu 2 miremadi@sharif.edu Abstract This paper suggests a method, called distributed voting  , to overcome the problem of the single point of  failure in a TMR system used in robotics and industrial control applications. It uses time redundancy and is based on TMR with disagreement detector feature. This method masks faults occurring in the voter where the TMR system can continue its function properly. The method has been evaluated by injecting faults into Vertex2Pro and Vertex4 Xilinx FPGAs. An analytical evolution is also performed. The results of both evaluation approaches show that the proposed method can improve the reliability and the mean time to failure (MTTF) of a TMR system by at least a factor of ))(2( t  R V    where )( t  R V   is the reliability of the voter. 1. Introduction The use of TMR as one of the most popular hardware redundancy methods to reach high reliability has long  been recognized in many papers [1, 2, 10, 11, 14] and applications [9, 12, 13]. In a TMR system, a module is repeated three times and the outputs of these three modules are voted to make a majority decision, hence eliminating the possibility of an incorrect output even if the output of a module is as such. One serious problem in the TMR systems is the single  point of failure in the voter. This means that if the voter fails the whole system fails, implying a single point of failure in TMR. Several works have tried to increase the reliability of the TMR system. However, all of these works have been  based on at least one of the following assumptions which simplify the problem and make it unrealistic:   Non-occurrence of CMF (common mode failure) that is the occurrence of a same fault with the same effect in more than one module leading to more than one incorrect output of the modules and finally incorrect output of the voter.   Non-occurrence of single point of failure, i.e., failures occurring in the voter is neglected. Reliability of a TMR system can be modeled with, )1(3 23 M M M TMR  R R R R   , where M   R  is the reliability of each module [3]. But this model is too unrealistic due to the above two assumptions. Some works have tried to mask CMFs, by the use of word by word voting [4] or minimizing occurrences of the CMFs  by using the concept of design diversity [5, 6]. Here three different implementations or design or technologies can  be used for each of the TMR modules, this means that if a module is sensitive to a specific type of fault, the other two modules may be robust against it. An example is the control system of the Boing777 which utilizes three different microprocessors, namely, a Motorola, an AMD, and an Intel processor, to build the TMR system [7, 8]. A method called TMR with disagreement detector (TMRDD) combines the concepts of hybrid redundancy with the spares modules to increase the number of faults the system can tolerate which is the subject of [10]. However, all of the above works have tried to increase the reliability of TMR, but these works have been based on using the two above assumptions especially the second one. Note that the reliability of the whole TMR system is meaningful if and only if the reliability of the voter is higher than the reliability of the 3 parallel modules in the TMR system, i.e., Voter TMR  R R    [3]. One way to solve the single point of failure problem is to triple the voter which is discussed in more detail in the next section. However this solution is applicable for some specific applications [14]. This paper presents a method to mask permanent and transient faults occurring in the voter of a TMR system, used in industrial and robotic control applications. This method combines the concept of TMRDD [10] with time redundancy and the use of n  spare voters. The number of faults that can be tolerated depends on the number of spare voters used. This method is experimentally evaluated using the Vertex2Pro and Vertex4 Xilinx FPGAs. Also analytical approach shows an improvement in the reliability and the MTTF of the TMR system. The rest of the paper organized as follows. Section 2 gives some related works. Modified TMRDD (MTMRDD) is presented in section 3. Sections 4 and 5 Proceedings of the 2nd IEEE International Symposium on Dependable, Autonomic and Secure Computing (DASC'06)0-7695-2539-3/06 $20.00 © 2006  are devoted to analytical and experimental evaluations and finally the conclusions are given in the last section. 2. Related work An approach to eliminate or minimize the single point of failure problem in a TMR system is to triplicate the voter. This approach has been used in the design of the fault tolerant multiprocessor system (FTMP) [11, 12, 13], shown in figure 1 and works well when data are transferred between a processor and memory. As you can see in figure 1, both the processor and memory are tripled. The processors write the results to the memories via triple voter and similarly read the results from the memories via another triple voter. In this way a fault in a  processor or a memory or a voter can be masked. In fact the single point of failure in each voter is eliminated by the use of two extra voters that work concurrently with the voter in the same layer and additional majority voting in the other layer. Although this method works well for data transmission  between memory and processor, it can not be used in other applications such as industrial and robotic control, where only a final output is required. For example in some applications, only one actuator is involved where the use of the triple voter is not very useful. Figure 1. FTMP block diagram. Detailed investigation of TMRDD at this stage is essential in order to highlight the single point of failure  problem. The block diagram of TMRDD is shown in figure 2 [10]. In this system the module outputs are sent to the voter via a switch, with voter choosing the majority vote as the final output. The duty of the disagreement detector is to find the faulty module (by comparison of the output of every module with the output of the voter) and replacing it with a spare one via a suitable command to the switch, hence detecting and masking the failure. 3. Modified TMRDD As mentioned before even TMRDD is affected by existence of single point of failure (in the voter). MTMRDD tries to solve the problem of the single point of failure by distributing the decision about fault occurrence in the voter or disagreement detector and also tries to overcome CMFs by the use of design diversity idea. The block diagram of the proposed method is shown in figure 3. Figure 2. Block diagram of TMR with disagreement detector. As in TMRDD the output of the modules are sent to the voter via a switch but the duty of the disagreement detector is modified. However the voter output is sent to the delay unit and after a specific delay to final output. During the delay the voter output is compared to that of the modules and the faulty module is replaced. (Note that the voter output is digital and the delay unit functions similar to a shift register.) Figure 3. MTMRDD block diagram when it can tolerate one failure in the voter or DD. In the case the number of disagreements between voter output and modules output exceeds half of the number of modules the disagreement detector concludes a failure in the voter, comparing the voter output and modules outputs again for masking transient fault in the voter and if for the second time the number of disagreements exceeds half of the number of modules, the disagreement Final Output SareModule Switch Module 1 Module 2 Module 3 SareModuleSareModule Disagreement Detector Voter Delay Spare Disagreement Detector SpareVoter Final Output Processor 1 Write Voter 1 Write Voter 2 Write Voter 3 Processor 2 Processor 3 Memory Module 1 ReadVoter 1 ReadVoter 2 ReadVoter 3 Memory Module 2 Memory Module 3 Switch Module 1 Module 2 Module 3 Sare Module 1SareModule2SareModulen Disagreement Detector Voter Proceedings of the 2nd IEEE International Symposium on Dependable, Autonomic and Secure Computing (DASC'06)0-7695-2539-3/06 $20.00 © 2006  detector replaces the voter with a spare one, with output of the new voter going directly to the output without  passing through the delay unit and the system functioning without delay thereafter. If immediately (in the next clock) the same problem is detected (more disagreement than half of the number of modules), the disagreement detector concludes its own failure and is replaced. Figure 4 shows the pseudo code of this method. Note that the variable  First   is initialized to true in the beginning, and  N 0  is the number of spares for voter and DD. Figure 4. MTMRDD pseudo code. The advantage of this method is solving the single  point of failure problem by distributing the decision  between two subsystems (voter and disagreement detector). Here it has been taken for granted that the spare components are intact, which is a common assumption [10]. The period elapsed in the delay unit equals the time necessary to check it again and to replace the voter which may be neglected in most applications because it is too short. Of course the number of spare voters and disagreement detectors may be increased to as many as desired which is a trade off between cost and reliability. Figure 5 shows a system capable of tolerating two failures in the voter or disagreement detector. What gives this method the ability to mask the voter and the disagreement detector failures is the delay imposed on the user at the beginning (the instance the system is turned on), remaining in the system as time redundancy to be used on occurrence of failure. Thus the method is suitable for applications such as robotics (where delays of the order of mili-seconds exist in the step motors) but not for real-time applications. 4. Analytical evaluation 4.1. Discussion By considering exponential probability distribution with the mean value of 1/   , 1/   and 1/   for failure occurrence in each module, disagreement detector and voter and also for replace of modules with the mean value of 1/µ, figures 6 and 7 show the Markov models of TMRDD and MTMRDD systems. (Figure 7 is the model of a system with one spare voter and disagreement detector). The used parameters are:  : voter failure rate  : disagreement detector failure rate   : failure rate of each TMR module µ: replace rate of each TMR module. In both figures 6 and 7 the name of each state is an ordered triple which its elements respectively represent the number of healthy modules, voters and disagreement detectors in that state. Reliability of both systems is  f   p  1  where  p  f   is the total possibility of being in the colored states. The transient analytical solution of these two Markov models involves tedious manipulations and is out of the scope of this work. However comparison between figures 6, 7 reveals that the number of intact states in MTMRDD is four times as many while the failed states has not increased as steeply, implying an improvement in reliability. In the next step reliability of two systems are compared using probabilistic methods. Figures 8, 9 are sketches of the serial/parallel models of each of TMRDD and MTMRDD systems with reliability of each block shown on it. As before  ,   ,     represent the failure rate of the voter, disagreement detector and each of the three modules respectively. The reliability of each of the three subsystems are given by t M  et  R      1)( t V  et  R      1)( t  D et  R      1)( [10], so that the reliability of each of the two systems are )(*)(*))(1()(* 23)()( 23 t  Rt  Rt  Rt  Rt  Rt  R  DV M M M TMRDD          .))(1)((* 12)(*))(1)((* 12)(*))(1()(* 23)()( 2223modified                          t  Rt  Rt  R t  Rt  Rt  Rt  Rt  Rt  Rt  R  D D DV V V M M M TMRDD  Noting 1)(0   t  R  X   the following relations are deduced )())(1)((* 12)( 2 t  Rt  Rt  Rt  R V V V V        , )())(1)((* 12)( 2 t  Rt  Rt  Rt  R  D D D D       and hence ).()( t  Rt  R TMRDDMTMRDD   Begin MisMatchCount := 0; FirstTime := 0; again: For i:=1 to N 0  do If Module(i).out <> Voter.out then MisMatchCount := MisMatchCount +1; If (MisMatchCount >= 2 ) and (FirstTime = 0) then begin FirstTime := 1; Goto again End If (MisMatchCount >= 2 ) and (FirstTime = 1) then If (First) then begin FirstTime := 0; First := False; Replace Voter with NewVoter; Send NewVoter.out to System.out Without delay; End Else begin Replace DisAgreement with NewDisAgreement; First := True; End; End. Proceedings of the 2nd IEEE International Symposium on Dependable, Autonomic and Secure Computing (DASC'06)0-7695-2539-3/06 $20.00 © 2006  Figure 5. MTMRDD block diagram when it can tolerate two failures in the voter or DD. Figure 6. MTMRDD Markov model. Figure 7. TMRDD Markov model.   Figure 8. Serial parallel model of TMRDD. Figure 9. Serial parallel model of MTMRDD. 4.2. Results Figure 10 compares the reliability of MTMRDD with one or two spare(s) for voter and DD and traditional TMRDD obtained from numerical solution of Markov models of these two systems. As is shown in the figure 10 the area under curve of the MTMRDD is increased which implies an improving in MTTF. Figure 10 shows that at any instant, MTMRDD with  N 0 =2 has the maximum reliability and TMRDD the Module #1 R  M (t)Module #2 R  M (t)Module #3 R  M (t)Disagreement Detector R  D (t) Voter R  V (t)Voter R  V (t) Disagreement Detector R  D (t)Module #1 R  M (t)Module #2 R  M (t)Module #3 R  M (t)Disagreement Detector R  D (t) Voter R  V (t)        3   3   2   3    2   2   2   2    3,2,2                                              2,2,2 1,2,2 3,2,12,2,1 1,2,1 3,1,12,1,1 1,1,1 3,1,22,1,2 1,1,2     3               2    lowest. Similarly for a fixed reliability, MTMRDD gives longer period of time, which is desirable in long life applications. As you can see, the difference between reliability of TMRDD and MTMRDD is increased by increasing the number of spare voters and DDs. 00.10.20.30.40.50.60.70.80.911.E+02 1.E+03 1.E+04 1.E+05 1.E+06 Time Cycles      R    e     l     i    a     b     i     l     i     t    y TMRDDMTMRDD N0=1MTMRDD N0=2 00.10.20.30.40.50.60.70.80.911.E+02 1.E+03 1.E+04 1.E+05 1.E+06 Time Cycles      R    e     l      i    a     b      i     l      i     t    y TMRDDMTMRDD N0=1MTMRDD N0=2 00.10.20.30.40.50.60.70.80.911.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Time Cycles      R    e     l     i    a     b     i     l     i     t    y TMRDDMTMRDD N0=1MTMRDD N0=2 00.10.20.30.40.50.60.70.80.911.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 Time Cycles      R    e     l     i    a     b     i     l     i     t    y MTMRDDTMRDDMTMRDD Sp=2 Figure 10. Reliability of TMRDD and MTMRDD with N 0 =1, 2 spare(s) for voter and disagreement detector. 5. Experimental evaluation The proposed system has been evaluated also by the use of Vertex2Pro and Vertex4 Xilinx FPGAs. In these two experiments an 8-bit counter and Z80 microprocessor considered as modules and HDL code of each system with N 0 =0, 1, 3 and 7, i.e., TMRDD and MTMRDD with 1, 3, and 7 spares for voter and DD has been synthesized and faults have been injected randomly to the voter or DD or modules by the use of a fault injector module included in the system. It has been supposed that each injected fault results a voter or DD or module failure. Faults has been assumed to occur exponentially distributed in time by mean rate of    and this implemented in hardware by the use of an exponentially random generator added to HDL code of the system. The relation between uniform and exponentially distributed random generator systems is as follow:   )1ln( U  x   where U   is a uniform random number and  x  is the exponential random number. After activating the injection signal by exponential random generator module a fault injects to the voter, DD or modules, this signal also commands to the exponential random generator module to waits till the end of generated number and then activates injection signal again if there are at least one other spare. For example with N 0 =3 exponential random generator module, generates 4 exponential random number. The MTTF can  be calculated as the summation of these times. Above experiment repeated 1000 times for each N 0  and the final results are the averages of these parameters. Figure 11 shows the MTTF for different working condition and some different   s. As it is obvious the hardware implementation results confirm the analytical results. We also logged the hardware overhead of the system during the synthesize process, tables 1 and 2 show these factors as a parameter of efficiency of the method.   1 .  0   E  +  0  0  1 .  0   E  +  0  1  1 .  0   E  +  0   2  1 .  0   E  +  0   3  1 .  0   E  +  0  4  1 .  0   E  +  0   5  1 .  0   E  +  0  6 0 0.003 0.006 0.009 0.012 Lambda      M     T     T     F TMRDDMTMRDD N0=1MTMRDD N0=3MTMRDD N0=7 Figure 11. MTTF of MTMRDD with N 0 =1, 3 and 7 spares and different fault rates. Table 1. Synthesize results of an 8 bit counter, using MTMRDD on a Xilinx Vertex2Pro FPGA.  Number of Spares (N 0 )0 (TMRDD)137 Number of FPGA Flip Flops used 1 24 (0%) 38(1%) 68(2%) 136(4%)  Number of 4-input LUTs used 2 52 (1% ) 91(3%) 141(5%) 286(10%) Time Overhead (cycle)0 2 6 14 1.Total number of FPGA Flip Flops is 2816. 2.Total number of 4-input LUTs is 2816. Table 2. Synthesize results of a Z80, using MTMRDD on a Xilinx Vertex4 FPGA.  Number of Spares (N 0 )0(TMRDD)137 Number of Used Slice Flip Flops used 1 805 (7%) 737(6%) 707(6%) 693(6%)  Number of Used 4-input LUTs used 2 6938(63%) 6746(61%) 6814(62%) 6712(61%) Time Overhead (cycle)0 2 6 14 1.Total number of FPGA Flip Flops is 10944. 2.Total number of 4-input LUTs is 10944. 65 10,,10,           346 10,10,10,           457 10,10,10,           5 10,,,          Proceedings of the 2nd IEEE International Symposium on Dependable, Autonomic and Secure Computing (DASC'06)0-7695-2539-3/06 $20.00 © 2006
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks