Science & Technology

Byzantium: Byzantine-Fault-Tolerant Database Replication Providing Snapshot Isolation

Description
Database systems are a key component behind many of today’s computer systems. As a consequence, it is crucial that database systems provide correct and contin- uous service despite unpredictable circumstances, such as software bugs or attacks. This
Published
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  Byzantium: Byzantine-Fault-Tolerant Database Replication ProvidingSnapshot Isolation Nuno Preguic¸a 1 Rodrigo Rodrigues 2 Crist´ov˜ao Honorato 3 Jo˜ao Lourenc¸o 11 CITI/DI-FCT-Univ. Nova de Lisboa 2 Max Planck Institute for Software Systems (MPI-SWS) 3 INESC-ID and Instituto Superior T´ecnico Abstract Database systems are a key component behind manyof today’s computer systems. As a consequence, it iscrucialthat databasesystems providecorrectand contin-uous service despite unpredictable circumstances, suchas software bugs or attacks. This paper presents the de-sign of Byzantium, a Byzantine fault-tolerant databasereplication middleware that provides snapshot isolation(SI) semantics. SI is very popular because it allows in-creased concurrency when compared to serializability,while providing similar behavior for typical workloads.Thus, Byzantium improves on existing proposals by al-lowing increased concurrency and not relying on anycentralized component. Our middleware can be usedwith off-the-shelf database systems and it is built on topof an existing BFT library. 1 Introduction Transaction processing database systems form a keycomponent of the infrastructure behind many of today’scomputer systems, such as e-commerce websites or cor-porate information systems. As a consequence, it is cru-cial that database systems provide correct and continu-ous service despite unpredictable circumstances, whichmay include hardware and software faults, or even at-tacks against the database system.Applications can increase their resilience againstfaults and attacks through Byzantine-fault-tolerant(BFT) replication. A service that uses BFT can toler-ate arbitrary failures from a subset of its replicas. Thisnot only encompasses nodes that have been attacked andbecame malicious, but also hardware errors, or softwarebugs. In particular, a recent study [13] showed that themajority of bugs reported in the bug logs of three com-mercial database management systems would cause thesystem to fail in a non-crash manner (i.e., by providingincorrect answers, instead of failing silently). This sup-ports the claim that BFT replication might be a more ad-equate technique for replicating databases, when com-pared to traditional replication techniques that assumereplicas fail by crashing [2].In this paper we propose the design of Byzantium,a Byzantine-fault-tolerant database replication middle-ware. Byzantium improves on existing BFT replicationfor databases both because it has no centralized compo-nents (of whose correctness the integrity of the systemdepends) and by allowing increased concurrency, whichis essential to achieve good performance.The main insight behind our approach is to aimfor weaker semantics than traditional BFT replicationapproaches. While previous BFT database systemstried to achieve strong semantics (such as linearizabil-ityor1-copyserializability[2]),Byzantiumonlyensuressnapshot isolation (SI), which is a weaker form of se-mantics that is supported by most commercial databases(e.g., Oracle, Microsoft SQL Server). Our design min-imizes the number of operations that need to executethe three-phase agreement protocol that BFT replica-tion uses to totally order requests, and allows concurrenttransactions to execute speculatively in different repli-cas, to increase concurrency. 1.1 Related Work The vast majority of proposals for database replicationassume the crash failure model, where nodes fail bystopping or omitting some steps (e.g., [2]). Some of theseworksalsofocusedonprovidingsnapshotisolationto improve concurrency [11, 10, 5]. Assuming replicasfail by crashingsimplifies the replicationalgorithms, butdoes not allow the replicated system to tolerate many of the faults caused by software bugs or malicious attacks.There are few proposals for BFT database replication.The schemes proposed by Garcia-Molina et al. [7] andby Gashi et al. [8] do not allow transactions to executeconcurrently,which inherentlylimits the performanceof the system. We improve on these systems by showinghow ensuring weaker semantics (snaphost isolation) andbypassing the BFT replication protocol whenever possi-ble allows us to execute transactions concurrently.HRDB [13] is a proposal for BFT replication of off-the-shelf databases which uses a trusted node to coor-dinate the replicas. The coordinator chooses which re-quests to forward concurrently, in a way that maximizesthe amount of parallelism between concurrent requests.  HRDB provides good performance, but requires trustin the coordinator, which can be problematic if repli-cation is being used to tolerate attacks. Furthermore,HRDB ensures 1-copy serializability, whereas our ap-proach provides weaker (yet commonly used) semanticsto achieve higher concurrency and good performance. 1.2 Paper Outline The remainderof the paper is organizedas follows. Sec-tion 2 presents an overview of the system. Section 3 de-scribes its design. Section 4 discusses correctness. Sec-tion 5 addresses some implementation issues, and Sec-tion 6 concludes the paper. 2 Byzantium Overview2.1 System model Byzantium uses the PBFT state machine replication al-gorithm [3] as one of its components, so we inherit thesystem model and assumptions of this system. Thus, weassume a Byzantine failure model where faulty nodes(clientorservers)maybehavearbitrarily. We assumetheadversary can coordinate faulty nodes but cannot break cryptographic techniques used. We assume at most f  nodes are faulty out of  n = 3 f  + 1 replicas.Our system guarantees safety properties in any asyn-chronous distributed system where nodes are connectedby a network that may fail to deliver messages, corruptthem, delay them arbitrarily, or deliver them out of or-der. Liveness is only guaranteed during periods wherethedelaytodeliveramessagedoesnotgrowindefinitely. 2.2 Database model In a database, the state is modified by applying transac-tions. A transaction is started by a B EGIN followed bya sequence of read or write operations, and ends with aC OMMIT or R OLLBACK . When issuing a R OLLBACK ,the transaction aborts and has no effect on the database.When issuing a C OMMIT , if the commit succeeds, theeffects of write operations are made permanent in thedatabase.Differentsemantics (or isolation levels ) have been de-fined for database systems [1], allowingthese systems toprovide improved performance when full serializabilityis not a requirement. Byzantium provides the snapshot isolation (SI) level. In SI, a transaction logically exe-cutes in a database snapshot. A transaction can commitif it has no write-write conflict with any committed con-current transaction. Otherwise, it must abort.SI allows increased concurrency among transactionswhen compared with serializability. For example, whenenforcing serializability, if a transaction writes somedata item, any concurrenttransaction that reads the samedata item cannot execute (depending on whether thedatabase uses a pessimistic or optimistic concurrencycontrol mechanism, the second transaction will eitherblockuntilthefirst onecommitsorwillhavetoabortdueto serializability problems at commit time). With SI, asonly write-write conflicts must be avoided, both transac-tions can execute concurrently. This difference not onlyallows increased concurrency for transactions accessingthe samedata items, but it is also beneficialforread-onlytransactions, since they can always execute without everneeding to block or to abort.The SI level is very popular, as many commer-cial database systems implement it and it has beenshown that for many typical workloads (including themostwidelyuseddatabasebenchmarks,TPC-A,TPC-B,TPC-C, and TPC-W), the execution under SI is equiva-lent to strict serializability [4]. Additionally, is has beenshown how to transform a general application programso that its execution under SI is equivalent to strict seri-alizability [6]. 2.3 System Architecture Byzantium is built as a middleware system that providesBFT replication for database systems. The system ar-chitecture, depicted in Figure 1, is composed by a set of  n = 3 f  + 1 servers and a finite number of clients.   BFTClientProxy ClientBizantiumClientProxy3f+1replicas   JDBC BFTRepl.Proxy BizantiumReplicaProxyDB BFTRepl.Proxy  BizantiumReplicaProxyDB BFTClientProxy ClientBizantiumClientProxy   JDBC Figure 1: System Architecture.Each server is composed by the Byzantium replicaproxy, which is linked to the PBFT replica library [3],and a database system. The database system maintainsa full copy of the database. The replica proxy is respon-sible for controlling the execution of operations in thedatabase system. The replica proxy receives inputs fromboth the PBFT replication library (in particular, it pro-vides the library with an execute upcall that is calledafter client requests run through the PBFT protocol andare ready to be executed at the replicas), and it alsoreceives messages directly from the Byzantium clients(which are not serialized by the PBFT protocol).The databasesystem usedin eachserver can be differ-ent, to ensure a lower degree of fault correlation, in par-ticular if these faults are caused by software bugs [12,  1 function db begin () : trxHandle 2 uid = generate new uid 3 coord replica = select random replica 4 opsAndHRes = new l i s t 5 BFT exec ( < BEGIN, uid , coord replica > ) 6 trxHandle = new trxHandle ( uid , coord replica , 7 opsAndHRes) 8 return trxHandle 9 end function 1011 function db op ( trxHandle , op ) : r esu lt 12 r esu lt = replica exec ( trxHandle . coord replica , 13 < trxHandle . uid , op > ) 14 trxHandle . opsAndHRes. add ( < op ,H( r esu lt ) > ) 15 return r esu lt 16 end function 1718 function db commit ( trxHandle ) 19 r esu lt = BFT exec ( < COMMIT, trxHandle . uid , 20 trxHandle . opsAndHRes > ) 21 i f  ( res == true ) 22 return 23 else 24 throw ByzantineExecutionException 25 endif  26 end function Figure 2: Byzantium client proxy code.13]. The only requirement is that they all must imple-ment the snapshot isolation semantics and support save-points 1 , which is common in most database systems.Users applications run in client nodes and access oursystem using a standard database interface (in this case,the JDBC interface). Thus, applications that access con-ventional database systems using a JDBC interface canuse Byzantium with no modification. The JDBC driverwe built is responsible for implementing the client sideof the Byzantium protocol (and thus we refer to it as theByzantium client proxy). Some parts of the client sideprotocol consist of invoking operations that run throughthe PBFT replication protocol, and therefore this proxyis linked with the client side of the PBFT replication li-brary.In our design, PBFT is used as a black box. This en-ables us to easily switch this replication library with adifferent one, provided both offer the same guarantees(i.e., state machine replication with linearizable seman-tics) and have a similar programming interface. 3 System Design3.1 System operation In this section, we describe the process of executing atransaction. We start by assuming that clients are notByzantine and address this problem in the next section.The code executed by the client proxy is presented inFigure 2 and the code executed by the replica proxy ispresented in Figure 3. We omitted some details (such aserror and exception handling) from the code listing for 1 A savepoint allows the programmer to declare a point in a trans-action to which it can later rollback. 1 upcall FOR BFT exec ( < BEGIN, uid , coord replica > ) 2 DB trx handle = db . begin () 3 openTrxs . put ( uid , < DB trx handle , coord replica > ) 4 end upcall 56 upcall for BFT exec ( < COMMIT, uid , cltOpsAndHRes > ) 7 : boolean 8 < DB trx handle , coord replica > = openTrxs . get ( uid ) 9 openTrxs . remove ( uid ) 10 i f  ( coord replica != THIS REPLICA) 11 execOK = exec and verify ( DB trx handle , 12 cltOpsAndHRes ) 13 if  ( NOT execOK) 14 DB trx handle . rollback () 15 return false 16 endif  17 endif  18 i f  ( verifySIProperties ( DB trx handle )) 19 DB trx handle . commit () 20 return true 21 else 22 DB trx handle . rollback () 23 return false 24 endif  25 end upcall 2627 upcall for replica exec ( < uid , op > ) : re sul t 28 < DB trx handle , coord replica > = openTrxs . get ( uid ) 29 return DB trx handle . exec ( op ) 30 end upcall Figure 3: Byzantium replica proxy code.simplicity.The approachtaken to maximize concurrencyand im-proveperformanceis to restrict the use of the PBFT pro-tocol to only the operations that need to be totally or-dered among each other. Other operations can executespeculatively in a single replica (that may be faulty andprovide incorrect replies) and we delay validating thesereplies until commit time.The application program starts a transaction by exe-cuting a B EGIN operation (  function db begin , Figure 2,line 1). The client starts by generating a unique iden-tifier for the transaction and selecting a replica respon-sible to speculatively execute the transaction – we callthis the coordinator replica for the transaction or simplycoordinator. Then, the client issues the correspondingBFT operation to execute in all replicas (by calling the  BFT exec( < BEGIN,... > ) method from the PBFTlibrary, which triggers the corresponding upcall at allreplicas, depicted in Figure 3, line 1). At each replica,a database transaction is started. Given the properties of the PBFT system, and as both B EGIN and C OMMIT op-erations execute serially as PBFT operations, this guar-antees that the transaction is started in the same (equiva-lent) snapshot of the database in every correct replica.After executing B EGIN , an application can execute asequence of read and write operations (  function db op ,Figure 2, line 11). Each of these operations executesonly in the coordinator of the transaction (by calling replica exec , which triggers the corresponding upcall atthe coordinator replica, depicted in Figure 3, line 27).  The client proxy stores a list of the operations and cor-responding results (or a secure hash of the result, if it issmaller).The transaction is concluded by executing a C OM - MIT operation (  function db commit  , Figure 2, line 18).The client simply issues the corresponding BFT opera-tion that includes the list of operations of the transactionand their results. At each replica, the system verifies if the transaction execution is valid before committing it(by way of the BFT exec( < COMMIT,... > ) upcall,Figure 3, line 6).To validate a transaction prior to commit, the follow-ing steps are executed. All replicas other than the pri-mary have to execute the transaction operations and ver-ify that the returned results match the results previouslyobtained in the coordinator. Given that the transactionexecutes in the same snapshot in every replica (as ex-plained in the B EGIN operation), if the coordinator wascorrect, all other correct replicas should obtain the sameresults. Ifthecoordinatorwas faulty,the resultsobtainedby the replicas will not match those sent by the client. Inthis case, correct replicas will abort the transaction andthe client throws an exception signaling Byzantine be-havior. In Section 5, we discuss some database issuesrelated with this step.Additionally, all replicas including the coordinator,need to verify if the SI properties hold for the commit-ting transaction. This verification is the same that isexecuted in non-byzantine database replication systems(e.g. [5]) and can be performed by comparing the writeset of the committing transaction with the write sets of transactionsthat havepreviouslycommittedafterthe be-ginningof the committingtransaction. As this process isdeterministic, everycorrect replica will consequentlyei-ther commit or abort the transaction.A transaction can also end with a R OLLBACK opera-tion. A straightforward solution is to simply abort trans-action execution in all replicas. We discuss the prob-lems of this approach and propose an alternative in Sec-tion 3.4. 3.2 Tolerating Byzantine clients The system needs to handle Byzantine clients that mighttry to cause the replicated system to deviate from the in-tended semantics. Note that we are not trying to preventa malicious client from using the database interface towrite incorrect data or delete entries from the database.Such attacks can be limited by enforcingsecurity/accesscontrol policies and maintaining additional replicas thatcan be used for data recovery [9].As we explained, PBFT is used by the client to exe-cute operations that must be totally ordered among eachother. Since PBFT already addresses the problem of Byzantine client behavior in each individual operation,our system only needs to address the validity of the op-erations that are issued to the database engines runningin the replicas.First, replicas need to check if they are receiving avalid sequence of operations from each client. Mostchecksaresimple,suchasverifyingifa B EGIN isalwaysfollowed by a C OMMIT  /R OLLBACK and if the uniqueidentifiers that are sent are valid.There is one additional aspect that could be exploitedby a Byzantine client: the client first executesoperationsin the coordinator and later propagates the complete se-quence of operations (and results) to all replicas. Atthis moment, the coordinator does not execute the op-erations, as it has already executed them. A Byzantineclient could exploit this behavior by sending a sequenceof operations during the COMMIT PBFT requests thatis different from the sequence of operations that werepreviously issued to the coordinator, leading to diver-gent database states at the coordinatorand the remainingreplicas.To address this problem, while avoiding a new roundof message among replicas, we have decided to proceedwith transaction commitment using the latest sequenceof operations submitted by the client.The codeexecutedby the replica proxyfor supportingByzantine clients is presented in Figure 4. To be able tocompareifthesequenceofoperationssubmittedinitiallyis the same that is submitted at commit time, the coor-dinator also logs the operations and their results as theyare executed (line 42). At commit time, if the receivedlist differs from the log, the coordinator discards exe-cuted operations in the current transaction and executesoperations in the received list, as any other replica.For discarding the executed operations in the cur-rent transaction, we rely on a widely available databasemechanism, savepoints , that enables rolling back all op-erations executed inside a running transaction after thesavepoint is established. When the B EGIN operation ex-ecutes, a savepointis createdin the initial databasesnap-shot (line 3). Later, when it is necessary to discard ex-ecuted operations but still use the same database snap-shot, the transaction is rolled back to the savepoint pre-viously created (line 17). This ensures that all replicas,including the coordinator, execute the same sequence of operations in the same database snapshot, guaranteeinga correct behavior of our system. 3.3 Tolerating a faulty coordinator A faulty coordinator can return erroneous results or failto return any results to the clients. The first situationis addressed by verifying, at commit time, the correct-ness of results returned to all replicas, as explained pre-viously. This guarantees that correct replicas will onlycommit transactions for which the coordinator has re-  1 upcall FOR BFT exec ( < BEGIN, uid , coord replica > ) 2 DB trx handle = db . begin () 3 DB trx handle . setSavepoint ( ’ init ’) 4 opsAndHRes = new l i s t 5 openTrxs . put ( uid , < DB trx handle , coord replica , 6 opsAndHRes > ) 7 end upcall 89 upcall for BFT exec ( < COMMIT, uid , cltOpsAndHRes > ) 10 : boolean 11 < DB trx handle , coord replica , opsAndHRes > = 12 openTrxs . get ( uid ) 13 openTrxs . remove ( uid ) 14 hasToExec = coord replica != THIS REPLICA 15 i f  ( coord replica == THIS REPLICA) 16 if  ( d i f f e r e n t l i s t ( cltOpsAndHRes , opsAndHRes)) 17 DB trx handle . rollbackToSavepoint ( ’ init ’) 18 hasToExec = true 19 endif  20 endif  21 i f  ( hasToExec) 22 execOK = exec and verify ( DB trx handle , 23 cltOpsAndHRes ) 24 if  ( NOT execOK) 25 DB trx handle . rollback () 26 return false 27 endif  28 endif  29 i f  ( verifySIProperties ( DB trx handle )) 30 DB trx handle . commit () 31 return true 32 else 33 DB trx handle . rollback () 34 return false 35 endif  36 end upcall 3738 upcall for replica exec ( < uid , op > ) : re sul t 39 < DB trx handle , coord replica , opsAndHRes > = 40 openTrxs . get ( uid ) 41 r esu lt = DB trx handle . exec ( op) 42 opsAndHRes. add( < op ,H( res ) > ) 43 return r esu lt 44 end upcall Figure 4: Byzantium replica proxy code, supportingByzantine clients.turned correct results for every operation.If the coordinator fails to reply to an operation, theclient selects a new coordinator to replace the previ-ous one and starts by re-executing all previously exe-cuted operations of the transaction in the new coordina-tor. If the obtainedresults do not match, the client abortsthe transactionby executinga R OLLBACK operationandthrows an exceptionsignaling Byzantinebehavior. If theresults match, the client proceeds by executing the newoperation.At commit time, a replica that believes to be the coor-dinator of a transaction still verifies that the sequence of operations sent by the client is the same that the replicahas executed. Thus, if a coordinator that was replacedis active, it will find out that additional operations havebeen executed. As explained in the previous section,it will then discard operations executed in the currenttransaction and it will execute the list of received oper-ations, as any other replica. This ensures a correct be-havior of our system, as all replicas, including replacedcoordinators, execute the same sequence of operationsin the same database snapshot. 3.4 Handling aborted transactions When a transaction ends with a R OLLBACK operation, apossibleapproachis to simplyabortthe transactionin allreplicas without verifying if previously returned resultswere correct (e.g., this solution is adopted in [13]). Inour system, this could be easily implemented by execut-ing a BFT operation that aborts the transaction in eachreplica.This approach does not lead to any inconsistency inthe replicas as the database state is not modified. How-ever, in case of a faulty coordinator, the applicationmight have observed an erroneous database state duringthecourseofthetransaction,whichmighthaveledto thespurious decision of aborting the transaction. For exam-ple, consider a transaction trying to reserve a seat in agiven flight with available seats. When the transactionqueries the database for seat availability, a faulty coordi-nator might incorrectly return that no seats are available.As a consequence, the application program may decideto end the transaction with a R OLLBACK operation. If no verification of the results that were returned was per-formed,the client operationwould have made a decisionto rollback based on an incorrect database state.Todetectthis, wedecidedtoincludeanoptiontoforcethe system to verify the correctness of the returned re-sults also when a transaction ends with a R OLLBACK operation. When this option is selected, the execution of a rollback becomes similar to the execution of a commit(with the obvious difference that it is not necessary tocheck for write-write conflicts and that the transactionalways aborts). If the verification fails, the R OLLBACK operation raises an exception. 4 Correctness In this section we present a correctness argument for thedesign of Byzantium. We leave a formal correctnessproof as future work. Safety Our safety condition requires that transactionsthat are committed on the replicated database observe SIsemantics.Our correctness argument relies on the guaranteesprovided by the PBFT algorithm [3], namely that thePBFT replicated service is equivalentto a single, correctserver that executes each operation sequentially. Sinceboth the B EGIN and the C OMMIT operations run asPBFT requests, this implies that every correct replicawill observe the same state (in terms of which trans-actions have committed so far) both when they begin atransactionandwhentheytryto commitit. Furthermore,they decide on whether a transaction should commit or
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x