Art & Photos

Hierarchical Modelling of Complex Control Systems: Dependability Analysis of a Railway Interlocking

Description
Hierarchical Modelling of Complex Control Systems: Dependability Analysis of a Railway Interlocking
Categories
Published
of 14
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  1.INTRODUCTION Today critical systems like railway station interlocking sys-tems employed in all technologically advanced countries arecontrolled by computers, mainly to cope with the increasingcomplexity of the control operations which is a source of failures. Several systems have been built [e.g.1–6], and havebeen used since a few years by those Railway Authoritieswishing to have a good cost/benefit ratio. To ensure an ade-quate dependability of the systems coming to operation,many standards exists (like the new ERTMS [7–9]) whichestablish targets on the dependability attributes, like safety,reliability and availability, and prescribe methodologies forsystem specification, design, verification and validation.As a consequence, evaluations are necessary for the sys-tem assessment and the fault forecasting. These evaluationsmay be performed both during the design phase, in order totest if the behaviour of the architecture under study is asexpected, or to compare different solutions, and during theverification phase, to assess whether the developed systemsatisfies the assigned targets. System assessment can be per-formed using several approaches like testing, fault injectionoften combined with analytical models. The modellingapproach is generally cheap for manufacturers and hasproven to be useful in all the phases of the system life cycle.During the design phase, models allow us to compare differ-ent solutions and to select the most suitable one (amongthose obeying other design constraints), and to highlightproblems within the design. In assessing an already built sys-tem, models allow to characterise specific aspects, to detectdependability bottlenecks and to suggest solutions to beadopted in future releases. In the literature several papersexist in the field of dependability analysis [e.g. 10–15]. Inaddition, many of the existing modelling techniques like vol 16 no 4 july 2001249 Comput Syst Sci & Eng (2001) 4: 249–261 © 2001 CRL Publishing Ltd Hierarchical modelling of complexcontrol systems: dependability analysisof a railway interlocking Andrea Bondavalli*, Manuela Nelli †  ¶ , Luca Simoncini ‡ and Giorgio Mongardi § *DIS, University of Florence, Via Lombroso 6/17, I-50134 Firenze, Italy. Email:a.bondavalli@dsi.unifi.it  † CNUCE Istituto del CNR, Via Vittorio Alfieri 1, 56010 Ghezzano (Pisa), Italy  ‡ ANSALDO Segnalamento Ferroviario, Piossasco, Torino, Italy  This paper reports an experience made in building a model and analysing the dependability of an actual railway station interlocking control system. Despiteour analysis has been restricted to the Safety Nucleus subsystem, mastering complexity and size required a considerable effort. We identified a modellingstrategy, based on a modular, hierarchical decomposition allowing to use different methods and tools for modelling at the various level of the hierarchy. Thismulti-layered modelling methodology led to an accurate representation of the system behaviour and allowed us (i) to keep under control the size of the mod-els within the different levels to be easily managed by the automatic tools, (ii) to make changes in the model in a very easy and cheap way. The paper con-tains also examples of the extensive analyses performed regarding the sensitivity of the dependability measures to variations of critical parameters andtowards the validation of the assumptions made.Keywords: computer based interlocking systems, analytical modelling and evaluation, hierarchical modelling methodology, unsafety, reliability, availability,sensitivity analysis International Journal of  Computer SystemsScience &Engineering  ¶ Manuela Nelli is now with ATOS SpA, Via Gonin 55, 20147 Milano, Italy  Markov Chains, Stochastic Petri Nets and Stochastic Activi-ty Networks [16] are supported by tools like UltraSAN [17],SURF-2 [18] and others [19] to help in building and solvingmodels. However, these computer controlled systems, com-pared to old electromechanical ones, pose non-trivial prob-lems in their design and analysis. Most difficult are thoseparts of the systems where the interactions between theredundant hardware and the application software have a crit-ical impact on system safety. These interactions influencethe modelling complexity since they induce stochasticdependencies that must be taken into account in modellingthe behaviour of components and their interactions.Such complexity exacerbates also some problems withmodelling that must be taken under special care. The firstproblem is complexity; in fact, although building models of simple mechanisms may be easy, the overall description of critical complex systems accounting at the same time for allthe relevant aspects is not trivial at all. To master complexitya modelling methodology is needed so that only the relevantaspects can be detailed still allowing numerical results to beeffectively computable: for instance, if a model specifies toomany details, then the number of its states may explode giv-ing raise to processing problems.Models may need many parameters (the meaning thereof is not always intuitive for designers) and require us to deter-mine the values to assign to them (usually by way of experi-mental tests) which may be very difficult. In addition,simplifying hypotheses are very often necessary to keep themodel manageable; the choice of such hypotheses is critical.Making assumptions, on one hand, allows us to obtain sim-pler models, but, on the other, leads to approximations of thesystem behaviour. The resulting error should always be esti-mated, either through sensitivity analyses or by comparingthe results returned by the model containing the assumptionand a model where it has been released. A feasible modellingapproach starts with simple models which are made moreand more complex and detailed by releasing those assump-tion having an unacceptable impact on the obtained results.We have experience in building a model on an actual criti-cal system. This system, developed by Ansaldo Trasporti, isa railway station interlocking control system: the ACC CIS(Computer Interlocking System) [4], is made of redundantreplicated hardware and redundant diverse software. In thispaper we describe our modelling experience taking intoaccount all aspects of their interactions (including correla-tion between the diverse software variants) and of the criti-cality of the several components. Our approach has been torealise the system model in a structured way. This allows usto cope with complexity and to focus, where interesting, onspecific behaviour for a more detailed analysis. Structuringin different levels separated by well identified interfacesallows to realise each level with different methodologies andto perform its evaluation with different tools without theneed of modifying the general structure of the model. Eachlevel has been subdivided into several sub-levels for a fineranalysis of some characteristics. Despite our effort for reduc-ing the complexity of the individual levels, some of themremained complex: these have been realised using differentmethodologies to compare and validate the used model. Thepaper contains also examples of the extensive analyseswhich have been performed performed regarding the sensi-tivity of the dependability measures to variations of criticalparameters and towards the validation of the assumptionsmade.The paper is organised as follows. Section 2 contains anoverview of the Ansaldo ACC CIS system; Section 3 con-tains a description of our assumptions, the meaning of thebasic parameters used and describes and of our modellingapproach. Sections 4 and 5 contain a description of the vari-ous models for one execution and for the mission, respec-tively. Section 6 contains a few examples of the evaluationswe performed: measurements of the sensitivity of depend-ability attributes at varying critical parameters and of theeffects of realising assumptions. Finally Section 7 concludesthis paper. 2. ANSALDO ACC CIS SYSTEM The ACC CIS system [4] is constructed by Ansaldo for rail-way station signalling control system. The CIS is structuredin two subsystems as shown in Figure 1: one devoted to vitalfunctions, the other to supervision ones. The former, calledVital Section, is the subsystem which performs vital func-tions: it comprises the (Safety Nucleus, SN) and a number of Trackside Units (TU), depending on the station size, thatcommunicate state of the station to the SN through a propri-etary serial bus. The latter performs Operations and Indica-tions, and Alarm, Recording and Telecontrol functions(OI-ART), it is made of a number of processing nodes, con-nected through a LAN, and is located close to the SN in theCentral Post. This subsystem allows continuous monitoringof the system state and events recording useful to make esti-mations and find out less reliable sections.The Safety Nucleus [20] is the core part of the system andits structure is partially reported in Figure 2. It comprises sixunits with a separated power supply unit. The three Nsi sec-tions represent three computers which are connected in TMRconfiguration, i.e. working with a ‘2-out-of-3’ majority:three diverse software programs performing iteratively thesame tasks, run inside three identical hardware sections. TheExclusion Logic is a fail-safe circuit whose job is to electri-cally isolate the section that TMR indicated to be excluded.The activation/de-activation unit is a device that switches onand controls power supply units. The video switching unitcontrols video images to be transmitted to the monitors of the operator terminal.The system is designed to keep on running even after thefailure of one section. In such a case the section is excluded 250computer systems science & engineering A BONDAVALLI ET AL Figure 1 Structure of the Ansaldo ACC CIS system  and the system is reconfigured to run with only two sectionswith a ‘2-out-of-2’ majority, until the failed section isrestored (usually after a few minutes). If a disagreement isfound while all three sections are active, the disagreeing sec-tion recovers the correct value and participate to the nextloop, if one section disagrees twice in a row it is excluded.No disagreement is tolerated when only two sections areactive. The TMR sections carry out the same functions; thehardware, the basic software architecture and the operatingenvironment are exactly the same, while ‘design diversity’[21] was adopted in the development of software applicationmodules. Each section is composed by two physically sepa-rated units which carry out different functions in parallel:• GIOUL (operator interface manager and logical unit):executes the actual processing and manages the interac-tions with the Operator Terminal and the OI-ART subsys-tem;• GP (trackside manager): manages the communicationswith the Trackside Units and modifies, whenever neces-sary, the commands given by GIOUL.The processing loops last 1 second for GIOUL and 250 msecfor GP: this causes the communications between GIOUL andGP belonging to the same TMR section to be performed atevery second (GIOUL loop), i.e. once every four loops of GP. Instead the communications between units of the sametype (between the three GIOUL units and separately betweenthe three GP units) are carried out at every processing loop.Each TMR (GIOUL and GP separately) unit votes on thestate variables and processing results. If it finds a singleinconsistency between its results when three sections areactive, it can identify and recover a presumably correct stateand continue processing. If one section disagrees twice in arow it is excluded. No disagreement is tolerated when onlytwo sections are active. Besides voting on software each unitcontrols communications and tests internal boards function-alities. Based on hardware test results, one section candecide to exclude itself from the system. Diagnostic tests arecarried out during the usual unit operation; they are imple-mented such that they do not modify the operating environ-ment. Each section is also able to detect malfunctions on itsdatabases thus deciding to exclude itself. In addition to thesetasks, GIOUL has to manage the communications with theOperator Interface, and to perform tests on keyboard inputsas well. If an error is detected a signal is displayed.The ACC-CIS is a critical system whose failures orunavailability can have catastrophic consequences, both eco-nomical and for human life. The constraints to be satisfiedby the Safety Nucleus are a probability of catastrophic fail-ure less than or equal to 1E-5 per year (according to IEC1508 per SIL 4 systems) and no more than 5 minutes downtime are allowed over 8600 hours (i.e. availability higherthan or equal to 0.99999). 3. ASSUMPTIONS AND MODELLINGAPPROACH3.1Assumptions and basic parameters We restricted our modelling effort to the Safety Nucleus, themost relevant part of the system. Our model does not includethe OI-ART subsystem neither the Trackside Units, but rep-resents the overall functionalities of the Nucleus, includingthe main features and the interactions among the differentcomponents. The main components that must be consideredin modelling the system are: the hardware, the software, thedatabases and, only for GIOUL, the acceptance test on theinput from the Operator Terminal. Hardware aspects coverinternal boards and physical characteristics of the communi-cations while software aspects cover the operating systemand software modules that are sequentially activated duringthe processing loops. The databases, whose control repre-sents one of the ways for detecting errors in various mod-ules, cover both hardware and software aspects: databasemalfunction can be due to either corruption of memory cellsor an error of the managing software. One of the tasks thatGIOUL has to perform is checking the correctness of theinputs issued by the operator terminal keyboard before trans-mitting them to the other modules for their processing; thischeck is very important since it can avoid the system to sendwrong commands to the Trackside Units. For this reason thesoftware module performing this check is not consideredtogether with the other software modules of GIOUL. Wealso made the choice of not modelling in detail the systemwhile an excluded section is restored. More precisely weaccount for the time required for restoring a section but weneglect the particular configurations that GIOUL and GP canassume during that period.Despite our decision to restrict to the Safety Nucleus aconsiderable effort has been necessary in order to accountfor (i) the complexity of the system, (ii) a proper trade-off between detailing the relevant mechanisms and an explosionof the model, (iii) the need for simplifying assumption and(iv) evaluation of the errors introduced with the consequentrequirement to release some of the assumptions. We identi-fied a modelling strategy based on a modular, hierarchicaldecomposition such that (i) different methods and tools maybe used for modelling at the various level of the hierarchyselecting the method which appears to be the most appropri- 251 HIERARCHICAL MODELLING OF COMPLEX CONTROL SYSTEMS vol 16 no 4 july 2001 Figure 2 Structure of the safety nucleus  ate, (ii) each model is small enough and does not result incomputational explosion, (iii) specific aspects are confinedin a few sub-models and modifications do not require to re-define the model completely. This has proven useful foranalysing the impact of the several assumptions and todecide which could be acceptable and those that had to berelaxed.To start with, we have built a simple model adopting thefollowing simplifying assumptions:1.‘Compensation’ among errors never happens;2.The Video Switching, the Activation/de-Activation andthe (external fail-safe) Exclusion Logic units are consid-ered reliable;3.The module that exploits majority voting within GIOULand GP is considered reliable;4.The Exclusion Management module within GIOUL andGP is considered reliable;5.Identical erroneous outputs are produced only by correlat-ed errors, while independent errors in the different unitsare always distinguishable by the voting.6.Symmetry: the error probabilities of GIOUL and GP arethe same for the three sections.7.The hardware communication resources of the Nucleusare considered together with the other hardware aspects;the software dealing with communications is assumedreliable.8.Errors affecting different components of the same unit(GIOUL or GP) are statistically independent.9.During one execution, both GIOUL and GP may sufferfrom many errors, at most one for each component (soft-ware, hardware, databases and acceptance test forGIOUL).10.The execution of each iteration is statistically indepen-dent from the others.11. GIOUL units receive identical inputs from the keyboard.Then, many of these assumptions have been released (oneat a time) to check their impact. In the rest of this paper wewill show our analyses concerning the release of the assump-tions 5 and 3. While assumption 5 has been completelyreleased by simply considering the event ‘ two or three wrongresults caused by independent errors may constitute a wrongmajority ’, we have substituted assumption 3 with more pes-simistic assumptions about the voter behaviour. First wehave considered the assumption 3a and then the assumption3b which is more conservative:3a. When the voter fails, (i) if a majority exist the voter doesnot not recognise it (ii) if a majority does not exists, thevoter selects a wrong result.3b. If the voter fails, independently from the existence of amajority, it selects a wrong result.The definition of the basic events we have considered andthe symbols used to denote their probabilities are reported inTable 1. 3.2Modelling approach The modelling approach is strongly related to the type of system considered where we have synchronous periodic exe-cutions of GIOULs and GPs. Each iteration can be modelledindependently from the modelling of a complete mission.The model was conceived in a modular and hierarchicalfashion, structured in layers. Each layer has been structuredfor producing some results while hiding implementationdetails and internal characteristics: output values from onelayer may thus be used as parameters of the next higher lay-er. In this way the entire modelling can be simply handled.Further, different layers can be modelled using differenttools and methodologies: this leads to flexible and change-able sub-models so that one can vary the accuracy and detailwith which specific aspects can be studied. The specificstructure of each sub-model depends both on the systemarchitecture and on the measurements and evaluations to beobtained. The model of the Safety Nucleus of the AnsaldoCIS system we have built, shown in Figure 3, can be splitinto two main parts: the first part deals with one execution 252computer systems science & engineering A BONDAVALLI ET AL Table 1 Basic error types and symbols used to denote their probabilities Error type (Events)Symbol (GIOUL)Symbol (GP) independent error in a unit caused by hardware fault qhlqhpan error caused by hardware fault is not detected by the diagnostics qhdlqhdpspurious error: the diagnostic errs detecting a (non-pre-sent) error due to independent hardware faultqhndlqhndpan independent error in a database is detectedqbrlqbrpan independent error in a database is not detected qbnrlqbnrpcorrelated error between three databasesq3bdlq3bdpcorrelated error between two databasesq2bdlq2bdpindependent software error in a unitqilqipcorrelated software error between three unitsq3vlq3vpcorrelated software error between two unitsq2vlq2vpindependent error of the acceptance test in a unit (it may either accept a wrong input or reject a correct one)qail.....correlated error between the acceptance tests of three units accepting the same wrong inputq3al......correlated error between the acceptance tests of two units accepting the same wrong inputq2al......two independent software errors indistinguishable by the voter (only in the model whereassumption 5 is released)qdlqdpat least two of three independent software errors indistinguishable by the voter (only in the modelwhere assumption 5 is released)qdtlqdtlindependent error of the voter (only in the model with the assumptions 3a and 3b)qvlqvp  and computes the probabilities of success or failure; the sec-ond one, building on this, allows the evaluations of thedependability attributes for an entire mission.In the previous section we explained that if a disagree-ment is found while all three sections are active, GIOUL andGP can identify and recover the correct value and participateto the next loop. This holds for one single disagreement: if one GIOUL or GP disagrees twice in a row the entire section(GIOUL AND GP) is excluded at the end of the currentloop. Therefore, in order to represent as close as possible theactual system behaviour, we had to make several models tokeep memory of previous disagreement of one section at thebeginning of the execution. To describe the GIOUL and GPTMR units at level 1 (Figure 3) we defined:•five sub-models of the behaviour of GP in configurations 3h, 3h.1, 3h.2, 3h.3, 2h (3h.x means that section x (1, 2or 3) disagreed during the previous loop);•five sub-models of the behaviour of GIOUL ( 3h, 3h.1,3h.2, 3h.3, 2h ).One system execution (level 2) lasts 1 second, it is com-posed by one GIOUL (1 second) and four GP (250 msec) it-erations and could also be considered as brief one-secondmission. Due to the need to keep memory of previous dis-agreements, the system can be in 17 different states and 17different models for one-execution have been defined, onemodels the system when only two sections are active, whilethe remaining describe the system with three active sections(level 2 in Figure 3):• 3h/3h : GIOUL and GP are correctly working at thebeginning of the execution.• 3h.x/3h : the GIOUL of section x (1, 2 or 3) disagreedduring the previous loop while GP is correctly working.• 3h/3h.y : GIOUL units are correctly working while the GPof section y (1, 2 or 3) dis-agreed during the previous(GP) execution.• 3h.x/3h.y : both the GIOUL of section x and the GP of section y disagreed during the previous loop (x and y canrepresent the same section or different ones)• 2h/2h : execution begins with only two active sections.Each of the 17 models uses, in different combinations andsequences, the same basic objects of level 1 and describesthe essential characteristics of the Safety Nucleus. The mod-els of level 2 are conceived to compute (and to provide tolevel three) the following probabilities:•  probability of safe failure of ‘one-execution’ ; it is theprobability that the system fails during one execution andstops avoiding catastrophic damages (this is ensured bythe ACC system that is designed so that it stops whenmalfunctions occur, forcing devices and subsystems tolock in a safe state).•  probability of catastrophic failure of ‘one-execution’ ; it isthe probability that the Nucleus, failing, keeps on sendingerroneous commands causing serious damages.•  probability of success of ‘one-execution’ ; it is the proba-bility that the system performs an entire one second mis-sion correctly. This implies that the system is ready tostart the next execution. It is actually composed by theprobabilities to reach one of the 17 possible configu-rations.The mission model (level 3), accounting for all the mod-els of the system related to one execution, describes the sys-tem behaviour during time. Once that both the singleexecution and the mission models have been constructed wefocused on which kind of measurements are required. Forour highly critical system reliability, safety and availabilityhave been evaluated. While reliability and safety can be bothobtained by computing the probabilities of catastrophic andsafe failure at time t  defined as the duration of the mission,availability required the definition of a specific availabilitymodel. 4. MODELS FOR ONE EXECUTION Two methodologies have been adopted to build the modelsfor ‘one-execution’: Discrete time Markov chains that havebeen manually drawn and the probability evaluation hasbeen accomplished using ‘Mathematica’, and StochasticActivity Networks that have been directly solved using thesoftware tool ‘Ultrasan’ [17]. Since Markov chains are oftenimpractical, even if they provide symbolic results, Ultrasanhas been adopted in order to avoid building 17 repetitivemodels using only Markov chains. Only two models (3h/3hand 2h/2h) have been completely built using Markov chainsin order to test and validate the results obtained by Ultrasan.This redundancy in building models has been very useful:some errors occurred during the model developing phasehave been detected. Ultrasan has been a good choice, sincewe could develop one single model that allowed to computethe results for the 17 different scenarios. In fact, by assign-ing different values to the variables of the model, thus repre-senting different initial markings, we could represent thedifferent states of the system and account for previous fail-ures of the various sub-components. The model is also ableto distinguish the various configurations without having toreplicate the unchanged aspects. Only two of the seventeenmodels were tested using Markov chains but those two mod-els are the most relevant ones and cover all the scenarios thatneed to be represented. The results obtained by Ultrasan andMarkov, using the same values for the parameters, were in 253 HIERARCHICAL MODELLING OF COMPLEX CONTROL SYSTEMS vol 16 no 4 july 2001 Figure 3 High level model of the Ansaldo CIS safety nucleus
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks