1. Introduction
In the physical realm, paper fiat currencies are almost impossible to duplicate. As a result, a spent US Dollar bill cannot be concurrently used by the same payor in a different transaction. In digital space, one could also rule out double spending occurrences by setting up a central arbiter. In this case, the central authority (e.g., a bank) would decide on the fate of a transaction and enforce consensus. However, such central arbiters do not exist in decentralized structures. Up until Bitcoin, all decentralized attempts suffered from the possibility of duplicating digital units and spending them more than once.
Any decentralized solution to the double spending problem requires the relevant participants to reach consensus and agree on the ordering of transactions. This will ensure the recording of when digital unit(s) of money were spent and invalidate any attempt by their previous owner to reuse them. Bitcoin’s innovation lies in its ability to offer such a solution even when a minority of participants may act maliciously. The elements of the Bitcoin Consensus (also known as the Nakamoto Consensus) span transactions, blocks and the blockchain. We will discuss them in a subsequent post. In this chapter, we introduce the problem of reaching consensus in distributed systems, of which the Bitcoin network is an instance.
In section 2, we provide a brief introduction to these systems and highlight the intimate bond between a consensus problem and the underlying system parameters. The set of relevant parameters typically includes the network topology, the nodes configuration, the reliability of the communication channel, the synchronicity model, the types of messages exchanged, the failure regime of nodes, and whether consensus is achieved in a deterministic or a randomized way.
In section 3, we discuss the classical Byzantine Generals Problem (BGP) introduced by Lamport et al. [5], [6]. The classical BGP result is easy to state but its proof is not necessarily straightforward. Given its importance and historical value, we revisit the proof in the hope of making it easier to follow. The Byzantine Generals Problem became an allegorical representation of that of reaching consensus in distributed systems. It is commonly stated that “Bitcoin solves the BGP”. However, Bitcoin’s consensus problem is defined on a system whose parameters differ from those of the classical BGP. We will revisit this in a subsequent post.
In section 4, we look at a different class of system models which includes fully asynchronous distributed systems over which consensus must be achieved deterministically. We state and prove the seminal result that such a consensus is impossible to achieve in the presence of even a single faulty node. This is known as the FLP impossibility result in reference to its authors Michael J. Fischer, Nancy Lynch, and Mike Paterson.
2. Distributed systems
We define a distributed system to be a set of nodes spread-out across space. Each node runs a distinct process and can communicate with other nodes. For all practical matters, one can think of a node as a separate computer and the act of running-a-process as that of executing a specific task or computation. Moreover, a client that uses the distributed system does not perceive its nodes as separate entities but rather as part of a unit. In this unit, processes are executed in order to achieve a common purpose.
It is conceivable for a given node to run a process while at the same time be in control of its rules of execution (e.g., mandate when to run a process). In general however, one cannot necessarily assume that execution and governance (i.e., the particular model of control or ownership) are carried out by the same entity. A centralized governance is one where the ruling over the system is concentrated (e.g., in an individual, an organization, a state). On the other hand, the ruling in a decentralized system is spread over multiple entities.
An important implication is that a distributed system can be centralized. For example, Facebook runs a centralized model where decision-making power is concentrated within the organization. It remains nevertheless a distributed system where different servers and computers implement different processes. Bitcoin on the other hand, is an example of a distributed system that is also decentralized for anyone can join or leave the network and run an independent node.
In what follows, we describe some of the merits of distributed systems. We also showcase the importance of reaching agreement in such structures and highlight some of the challenges of doing so in the presence of faults. We finally introduce the notion of consensus and the parameters that characterize its associated system model.
The merits of a distributed system – In order to better appreciate the value of a distributed system, we mention three of its potential advantages over its non-distributed counterpart:
- Better scaling: In a scenario where a particular node receives excessive traffic, there may be a threshold beyond which the node’s performance becomes noticeably impacted. One could upgrade the processing power of the node but the merits of this vertical scaling are bound to reach a limit. A more suitable alternative would be to distribute the workload by adding more nodes to the system.
- Higher resilience: In a single-node system, any failure could be severely damaging. In order to mitigate the risk of a single-point failure and increase the level of tolerance for faulty behavior, one can create more redundancy by adding more nodes.
- Lower latency: If the system’s clients were spread across the globe, information would have to travel for longer distances resulting in longer latencies. This can be improved by geographically distributing a richer set of nodes.
The need for agreement in a distributed system – In a distributed system where different nodes run their own processes, communicate with each other, and alter their perceptions of the state of the system accordingly, these nodes may end up having different concurrent views of the system. The need to agree on a common view is imperative in certain cases as highlighted by the following examples:
1) The distributed Transaction-Commit problem: A transaction gets divided into processes run by different nodes. The objective is to decide whether or not to commit the transaction to a given database. The important consideration is that if any node rejects it, then all nodes must do so too. Otherwise, the system’s view will be inconsistent as some nodes agree to include it while others don’t. A commitment of the transaction must occur if and only if all relevant nodes agree to do so.
2) State Machine Replication (SMR) systems: A state machine reflects the state of a system at a given point in time. It takes a set of inputs or commands, performs a set of operations (collectively defining a transition function), and then computes an output used to update the state of the system. An SMR distributed system consists of various nodes that are all supposed to run the same transition function. In order to ensure a consistent view of the system’s state, there needs to be agreement on the inputs to the transition function i.e., the current state of the system as well as inputs used to alter it.
A client may send a number of sequential requests to an SMR system. The ordering of these requests is paramount and any two nodes executing them out of order will have two conflicting views of the state of the system. This is known as the log replication problem (it is a reference to the idea that the sequence of commands is stored in a log). Assuming that all nodes operate the same deterministic transition function, an agreement in this context corresponds to an alignment among all nodes on the sequencing of the commands.
One important example of an SMR is the Bitcoin ledger. The state of the system at a given time corresponds to the set of Unspent Transaction Outputs (UTXO) (the reader can refer to Bitcoin Transactions (pre-segwit) for an introduction to UTXOs). Simply stated, this set corresponds to all public keys holding unspent satoshis. Inputs that alter the state of the ledger consist of valid Bitcoin transactions. Transactions however must be executed in a well-defined sequence agreed upon by all nodes. Otherwise, a Bitcoin transaction considered as valid by one node could be invalidated by another. We will discuss the building blocks and details of the Bitcoin consensus protocol in a later post.
3) Clock synchronization: In order for a system’s nodes to execute certain processes in a well-defined order, they need to share a common view of time. The challenge is that the internal clocks of nodes differ in the way they count the passage of time. The difference is due to clock drift, usually caused by relativistic effects. Clock synchronization is the problem of coordinating the clocks of various nodes at regular time intervals to ensure ordered execution of events. This problem can be equivalently stated as one of reaching agreement on a common value of time between various nodes.
The challenge of reaching agreement in the presence of faults – In light of the above examples, it becomes clear that some distributed systems must ensure that their nodes reach agreement. In a perfect world where nodes relay information truthfully, agreement could be easily achieved. For example, each node could be requested to relay its information to peers and then have all nodes apply a common function. Nodes however, may not be truthful all the time. In general, one assumes that a certain maximal number of them can be faulty. The behavior of faulty nodes is specified by a pre-defined failure model which may consist of:
- Crash failure: In this model, a node can either be fully operational or out of order. In particular, a node may fail in the middle of an execution. As a result, it could have sent information to only a small subset of its peers before crashing.
- Omission failure: Information sent by a node may not be received by a peer. This can be due to various factors including transmission problems or buffer overflow.
- Byzantine failure: Byzantine faults are the weakest form of failures in the sense that faulty nodes can behave arbitrarily without abiding by specific constraints. In a byzantine regime, a faulty node can act maliciously vis-a-vis one of its peers at a certain time instance and honestly at another. In this context, malicious behavior is to be understood in its general form including e.g., communicating wrong information to peers or abstaining from sending or relaying any information. These faults are particularly important in a decentralized setting.
Consensus in distributed systems – The real challenge with distributed systems is to reach agreement in the presence of faulty behavior. More formally, the act of reaching agreement is encapsulated in the notion of achieving consensus. An algorithm is said to achieve consensus in a distributed system if it guarantees that the following three criteria are met:
- Agreement: All non-faulty nodes (also known as correct nodes) must agree on the value (or array of values) that they compute. In other words they must all share the same value(s) after the algorithm is executed.
- Validity: In the absence of any constraint, non-faulty nodes could agree on trivial values irrespective of the nature of the problem. In order for them to be meaningful, agreed-upon values must satisfy more stringent constraints. The validity criterion ensures that non-faulty nodes decide on “acceptable” value(s) for some notion of “acceptable”. Different validity requirements lead to different types of consensus.
- Termination: All non-faulty nodes must eventually decide on a value (or array of values).
The above consensus criteria are usually expressed in terms of safety and liveness properties. Informally, safety is a property that must be continuously observed by the system in order to ensure that no “bad” outcome occurs. Liveness on the other hand, guarantees that a “good” outcome will eventually take place. Liveness properties do not need to be continuously observed but must eventually be met:
- The Agreement criterion ensures that non-faulty nodes never diverge in their decision making. It is thus considered a safety property.
- The Validity criterion guarantees that non-faulty nodes never choose an inadequate value. As a result, it is also considered a safety property.
- The Termination criterion on the other hand, guarantees that eventually every non-faulty node will decide on a value. It is hence a liveness property.
The aforementioned Termination criterion requires that for each and every iteration of the consensus algorithm, non-faulty nodes decide on a value (or array of values). This definition characterizes a class of consensus algorithms known as deterministic. Termination could also be defined stochastically, leading to the class of randomized consensus algorithms. In this case, it becomes:
- Termination: All non-faulty nodes must eventually decide on a value (or array of values) with probability 1.
In other words, some executions of the algorithm may fail to terminate as long as the probability of it happening approaches 0 when the number of executions tends to infinity.
System model specification – The characterization of a distributed system requires specifying a number of system parameters. They include:
- Nodes configuration: A system may consist of a pre-defined set of static nodes that never changes over the course of execution. For instance, nodes could be geographically spread servers deployed by an organization to service its global client base. Configurations could also be dynamic (e.g., Bitcoin) with different nodes joining or leaving at various points in time.
- Network topology: Nodes may be connected in various ways. For instance, a node can be linked to a select set of peers or to every other node as part of a complete graph topology.
- Communication channel reliability: In addition to specifying the failure regime of nodes, a full description of a distributed system requires defining the reliability of its underlying communication channel. For all practical purposes, we will assume that the infrastructure is reliable and limit faulty behavior to nodes.
- Communication delay: A system can be classified as synchronous, partially synchronous or fully asynchronous. In a synchronous network, messages sent are guaranteed to be delivered to peers within a fixed delay of seconds known a priori. This presupposes that nodes have a common reference time against which is measured and is typically achieved through clock synchronization at regular intervals called rounds. One advantage of synchronous systems is that nodes can recognize if a message has not been sent by waiting seconds from the beginning of a specific round.
A more realistic model is that of an asynchronous network where no guarantees are imposed on message delivery delay except for the assurance that messages sent will eventually be delivered. Contrary to the synchronous case, asynchronous networks do not rely on a notion of a common reference time. An important result in distributed systems theory is the impossibility of achieving deterministic consensus in a fault-tolerant asynchronous setting. This is the FTP impossibility result [4] that we will discuss in section 3. The result ceases to hold if the deterministic constraint is replaced by its randomized counterpart [1], underscoring as such the importance of specifying the system parameters prior to solving for consensus.
A model that lies midway between these two extremes, is the partially synchronous one [2]. Partial synchrony comes in different flavors. One version assumes the existence of a not known a priori upper-bound on the delay to deliver a message from one node to a peer. Another version assumes that the bound is known a priori but only guaranteed to apply starting at an unknown time instance. - Message authentication: Two types of messages could affect the process of reaching consensus in distributed systems. Unauthenticated or oral messages can be tampered with. A malicious node could modify the content of a message it received before it relays the altered version to a peer. It could also create a message and claim that it received it from a peer. Authenticated or signed messages on the other hand, are tamper-proof and forgery attempts will be detected with overwhelming probability. As a result, solving for consensus with signed messages is generally easier because the arsenal of malicious weapons does not include forgery.
In summary, consensus in distributed systems depends on a number of parameters. In order to specify a consensus problem, one needs to define:
- The system parameters including the nodes configuration and topology, reliability of the channel, synchronicity model, and types of messages.
- The faulty nodes failure regime (e.g., byzantine).
- The nature of the Termination criterion (i.e., deterministic or randomized).
- The consensus problem as defined by the relevant validity criterion.
3. The classical Byzantine Generals Problem
The Byzantine Generals Problem (BGP) introduced by Lamport et al. in 1982 [5] describes how a distributed system can operate effectively even if some nodes fail under a byzantine fault regime. It portrays the system as an army whose generals need to agree on a common action plan (e.g., attack or withdraw) and where some may be traitors, sending conflicting messages to peers. In essence, the BGP is an allegorical representation of the problem of reaching consensus in distributed systems and is defined as follows:
1) System parameters:
- Nodes configuration: The system consists of a set of pre-defined and static nodes (i.e., addition or removal of nodes is not allowed). Each node has a device (e.g., a sensor) that runs a process (e.g., a sensor measurement) and computes a private value (e.g., a reading from sensor measurement).
- Network topology: The network is modeled as a complete communication digraph with nodes, where each two nodes are linked by a bidirectional communication channel or edge.
- Communication channel reliability: The edges in are assumed to be fail-safe i.e., truthful with no error in communication.
- Communication delay: The edges in exhibit negligible communication delay. More importantly, the network is assumed to be synchronous.
- Message authentication: Messages are assumed to be unauthenticated but the identity of the sender is always known to the receiver. Note that in [5], the authors also consider a variant of the problem with signed messages instead.
2) Failure regime: Although the communication channel over is assumed to be fail-safe, a subset of may be faulty. We assume that at most out of the nodes could be faulty under a byzantine failure regime.
3) Termination criterion: The model assumes a deterministic termination rule.
4) Agreement and validity criteria: Each non-faulty node in computes an –vector whose entry is a value it calculates for the node such that:
- Agreement: All non-faulty nodes compute the same -vector
- Validity: If node is non-faulty and its private value is then the entry of computed by all non-faulty nodes is . In other words,
These consensus criteria are known as the Interactive Consistency (IC) formulation of the classical BGP [6]. Note that they do not require specifying which nodes are faulty. Furthermore, the elements of corresponding to faulty nodes may be arbitrary.
It turns out that the (IC) formulation can be equivalently expressed in two other ways: A Byzantine Generals (BG) formulation and a Consensus (C) one. The (BG) formulation introduced in [5] states that a General in the Byzantine army must send a value to his lieutenants such that:
- Agreement: Honest lieutenants (i.e., non-faulty nodes) agree on a value
- Validity: If the General is honest (i.e., source node is non-faulty), then
In the (C) formulation [3], each node is endowed with an initial value and the Agreement and Validity criteria become:
- Agreement: All non-faulty nodes agree on the same single value
- Validity: If all non-faulty nodes share the same initial value then their agreed upon value must be
BGP consensus formulations equivalence: In what follows we prove the equivalence of all three consensus formulations. More specifically, we show that an algorithm that can solve one of the problems can also be used to solve the other two. We denote by and any algorithms that respectively solve the (C), (BG), and (IC) formulations of the classical BGP.
1) If there exists an then there exists an : Without loss of generality, assume that the initial state of the (BG) formulation consists of general communicating his private value to his lieutenants. Conduct one round of communication and let be the value received by lieutenant Set it as node ‘s initial value. Clearly, we also have that node ‘s initial value is Now run on these initial states:
- Since the Agreement criteria of ensures that all non-faulty nodes agree on the same single value all honest lieutenants will certainly agree on the same value This guarantees the Agreement criteria of (BG).
- Now suppose that the general is honest (i.e., node is non-faulty). Then all non-faulty lieutenants will share the same initial value (i.e., the general’s private value). The Validity criteria of (C) would then ensure that their agreed upon value is This proves that the Agreement criteria of (BG) is satisfied.
2) If there exists an then there exists an : For each non-faulty node let denote its private value and associate with it an -dimensional vector whose entries are all initialized to except for the entry whose value is set to In other words, is initially set to For each node run with node acting as general. Upon termination, update the entry of each with the resulting value computed by node :
- If were a non-faulty node, then the (BG) Agreement and Validity criteria will ensure that all non-faulty lieutenants agree on the same value As a result, the entry of each will be the same and equal to
- If were a faulty node, then the (BG) Agreement criterion will ensure that all non-faulty lieutenants agree on some common value. As a result, the entry of each will be the same.
3) If there exists an then there exists an : For each non-faulty node let denote its private value. Without loss of generality, suppose that the first nodes are non-faulty (i.e., Run to obtain an interactive consistecy vector Note that the values () are arbitrary as they correspond to faulty nodes. Let each non-faulty node pick the first entry of (i.e., ). This ensures that the Agreement and Validity criteria of (C) are met:
- All non-faulty nodes agree on the same single value, namely
- If all non-faulty nodes shared the same initial value then
An impossibility result for the classical BGP: It is not always possible to achieve consensus in a classical BGP setting. In [5] and [6], the authors showed that a necessary and sufficient condition for this to happen is for the total number of nodes to strictly exceed three times the number of faulty ones (i.e., ). We will lean on the (IC) formulation to demonstrate that this condition is necessary by showing that it is impossible to reach consensus if We then rely on the equivalent (BG) formulation to prove that the condition is sufficient by describing an algorithm that achieves consensus whenever the condition is met [5].
We first start by formalizing the description of some of the system’s parameters introduced earlier. Recall that the underlying communication network is a digraph G with nodes, at most of which can be faulty. We succinctly denote this set-up by the triplet (). It is common to attach a processor to node and let be the set For all practical matters, the terms processor and node can be freely interchanged. Each processor has a private value (or initial state value) drawn from a set We let denote the private value of
The objective is to devise an algorithm that can reach consensus irrespective of which processors are faulty, as long as there are at most of them. A particular instance of () is called a system and is specified by:
- The subset of non-faulty processors. Note that
- The behavior of the processors as defined by the value that processor receives for processor when the transmission happens over some path in Clearly, if all processors were non-faulty, would receive the exact value sent by Faulty processors on the other hand, may behave maliciously and their behavior may vary from one processor to another.
We denote the system associated with a given subset and behavior by More formally, is defined as the map:
where is the set of all non-empty strings over (i.e., paths in ) and is an appropriate set of initial state values. We require that this map satisfies the following:
- Initial state specification: In other words, maps each processor to its private value.
- Behavior: For any path let be interpreted as
“ told that told that .. that told that its value was “.
Note that if then and we expect to be equal to Indeed, by definition, a non-faulty must truthfully communicate whatever it receives. A behavior that ensures this condition is said to be consistent with
We rely on this formalism to define the notion of interactive consistency. Let be the space of all allowable systems on () i.e., any system with:
- A set of non-faulty processors satisfying and;
- A behavior such that is consistent with .
In what follows, it is understood that a system is defined on () and we write instead of
Define the map to be:
where for an allowable system the output corresponds to the value of processor computed by processor in the (IC) formulation. If the output is taken to be ‘s private value. The consistency vector computed by is then the -dimensional vector:
Note that is calculated based on one or more pieces of information available to processor Each such piece of information is received by over some path in and is hence of the form where We denote the restriction of to paths in starting with by
We say that solves the (IC) formulation if the following consensus conditions hold:
1) Agreement condition:
Intuitively, this condition requires that any two non-faulty processors share the same consistency vector. This is the Agreement criterion of the (IC) formulation.
2) Validity condition:
Intuitively, this condition requires that the entry corresponding to a non-faulty processor in the consistency vector computed by a non-faulty processor be ‘s private value. This is the Validity criterion of the IC formulation.
We can now formally state and prove the classical BGP’s impossibility result:
and that solves the (IC) formulation of BGP.
The proof is a reductio ad absurdum. Suppose that given and one were able to find such an (i.e., an that achieves consensus on any allowable system Our objective is to construct three systems whose coexistence would contradict the Agreement criterion needed for to be an acceptable solution.
Since one can partition into three non-empty subsets and such that
max()
Furthermore, since such that
Consider the system where and some behavior consistent with Then since Similarly, we can consider the two other systems and in where is some behavior consistent with and with
Suppose that in addition to being respectively consistent with and , behaviors and also satisfied the following constraints:
- For any and are indistinguishable, i.e., (this refers to the restriction of a behavior to paths in starting with
- behaviors and are indistinguishable, i.e.,
We could then reach the desired contradiction as follows:
- solely depends on And since it is equal to
- by the Validity criterion of
- by design of the behaviors and
- by the Validity criterion of
- solely depends on And since it is equal to
- As a result, This contradicts the Agreement criterion of (IC) since is consistent with and
Consequently, all that is needed to complete the proof is to construct and satisfying these constraints. Note that elements of can be of three types:
1) Strings that don’t end with a processor in In this case, let
2) Strings of length 1 or 2 that end with a processor in let
and
3) Strings of length greater than 2 that end with a processor in For any string ending with a processor in and let
and
and
and
Note that by defining the action of the various behaviors on a string of length in terms of the action of one of these maps on a string of length one can easily compute the actual values recursively as they have been previously defined for the cases and
Clearly, behavior is consistent with Indeed, (i.e., is of the form or ), and and we have and
Similarly, behavior is consistent with since (i.e., is of the form or ), and and we have and
Finally, behavior is consistent with since (i.e., is of the form or ), and and we have and
Next we show that behaviors and are indistinguishable (i.e., ) and behaviors and are indistinguishable (i.e., ).
First, note that not ending in a processor in the construction mandates that In particular this holds true for such strings that start with a processor in and so In addition, this holds true for such strings that start with a processor in and so
To show it for strings ending in a processor in we proceed by induction on the length of If is of length 1, i.e., the construction mandates that and so and are indistinguishable over elements of Similarly, the construction mandates that and so and are indistinguishable over elements of
Now suppose that the result holds true for strings of length that end in a processor in Relevant strings of length must be of the form or (). We must show that:
- and
- and
We will show it only for 1. as 2. can be done in exactly the same way:
- (by construction), which is equal to (by induction), which in turn is equal to (by construction).
- (by construction), which is equal to (by induction), which in turn is equal to (by construction).
- (by construction), which is equal to (by construction).
Here is a summary of the three systems for the case and
The intuition is as follows:
- From the point of view of processor systems and are indistinguishable because and are identical when restricted to strings starting with As a result, cannot tell whether is faulty (i.e., system is applicable) or is (i.e., system is applicable). In order not to violate the Validity condition in is then forced to register for the value
- Similarly, from the point of view of processor systems and are indistinguishable because and are identical when restricted to strings starting with As a result, cannot tell whether is faulty (i.e., system is applicable) or is (i.e., system is applicable). In order not to violate the Validity condition in is then forced to register for the value
- But in order not to violate the Agreement condition in system processors and must both register the same value for processor However, this is not the case since registered while registered
Note that this proof fails if This is because any 3-subset partition of would have at least one subset with This would cause system to be not allowable (i.e., ).
Solving the classical BGP for : We now show that the necessary condition is also sufficient. We do so by describing an algorithm that achieves consensus in the (BG) formulation.
For a given allowable system in and processor acting as general, we make explicit the dependence of on and and write We define the map to be:
where the output corresponds to the value that processor computes for We say that solves the (BG) formulation if the following consensus conditions hold:
1. Agreement:
Intuitively, non-faulty lieutenants must compute the same value for general .
2. Validity: If then
Intuitively, this requires that the value that a non-faulty lieutenant computes for a non-faulty general be ‘s private value.
To devise such a map, we introduce a recursive algorithm over that takes three inputs: a subset a processor and an iteration variable such that
- Base case When is 0, processor sends its value to every other processor who receives value and attributes it to
- General case for
i) Processor sends its value to every other
ii) Processor receives value A new instance of algorithm is then executed for each with an iteration counter set to and a processor set Each such iteration sends to the remaining processors This step runs an instance of for each totaling instances.
iii) let denote the value that computed for under algorithm in step ii). Subsequently, computes the following value and assigns it to
We can represent the above logic in pseudo-code as follows:
Define
If is equal to
For each do the following:
receives
assigns the value to
Else, if :
For each do the following:
receives and sets it as its private value
Run and store the resulting vector where denotes the value that computed for and where
For each does the following:
Assign to where the index
Return the vector where
Algorithm invokes algorithms of order namely, . Similarly, each algorithm of order invokes others of order The lowest order ones have and are called times. Finally, each algorithm of order 0 sends messages, resulting in a total of messages and a complexity of
we now define the map as follows:
where is the appropriate component of the vector returned by with an iteration count set to (the maximal number of faulty processors allowed).
We claim that this map solves the (BG) formulation of the classical BGP whenever Before we prove its correctness, we look at two clarifying examples (we will drop the superscript for ease of notation)
Example 1, Let and be faulty. There are two cases depending on whether the general is faulty or not. We will refer to processors by their indices and enclose received values in brackets and computed values in parentheses:
We describe the case of a faulty general (the other one can be analyzed similarly):
- Algorithm is invoked and sends its value to every lieutenant
- Lieutenant receives value Let and Subsequently, each acts as general and runs a new instance of algorithm to send to the remaining two lieutenants. More specifically:
-
- Under sends to lieutenant and to lieutenant
- Under sends to lieutenant and to lieutenant
- Finally, under sends to lieutenants and to lieutenant
- Since the algorithm is running instances with it must be that lieutenants and compute a value equals to under Similarly, lieutenants and compute under while lieutenants and compute under
- Finally, the value that lieutenants and computes for under is equal to:
Example 2, Let and be faulty. Here too, there are two cases depending on whether the general is faulty or not. We treat the case of a faulty general (the other case can be analyzed similarly) and follow the convention of enclosing received values in brackets and computed values in parentheses:
- Algorithm is invoked and processor sends its value to every lieutenant
- Lieutenant receives value Let and Subsequently, each acts as general and runs to send to the other five lieutenants.
- The next step is to compute the action of We illustrate it for where processor acts as general and sends its value to the remaining five lieutenants . In this case, lieutenant receives Subsequently, each acts as general and runs a new instance of algorithm to send to the remaining four lieutenants:
-
- Under processor acts as general and sends its value to lieutenants Lieutenant receives
- Under processor acts as general and sends its value to lieutenants Lieutenant receives
- Under processor acts as general and sends its value to lieutenants Lieutenant receives
- Under processor acts as general and sends its value to lieutenants Lieutenant receives
- Under faulty processor acts as general and sends some unknown value(s) to lieutenants Each lieutenant receives an unknown value that we denote by a question mark (?).
- Since each received values also serves as the computed value that the relevant processor attributes to We can now compute the value that the non-faulty lieutenants and compute for under
-
- Lieutenant computes:
-
- Lieutenant computes:
-
- Lieutenant computes:
-
- Lieutenant computes:
- Similarly, one can evaluate
-
- For we find that the values that the non-faulty lieutenants and compute for are all equal to
- For we find that the values that the non-faulty lieutenants and compute for are all equal to
- For we find that the values that the non-faulty lieutenants and compute for are all equal to
- For we find that the values that the non-faulty lieutenants and compute for are all equal to
- For we find that the values that the non-faulty lieutenants and compute for are all equal to where denotes the value that communicates to processor under These values may be different from each other since is faulty.
- Finally, the value that the non-faulty lieutenants compute for under are as follows:
- Lieutenant computes
- Lieutenant computes
- Lieutenant computes
- Lieutenant computes:
Proof of the algorithm’s correctness: We wrap up this section with a correctness proof for the aforementioned algorithm whenever .
Let be a set of processors, with acting as general for some . Furthermore, assume that at most out of processors can be faulty, with We claim that the vector returned by satisfies the Agreement and Validity conditions of the (BG) consensus formulation.
We will prove this by induction on and Note that serves as the iteration count in as well as the maximal number of faulty processors in .
Base case: Given any subset such that and such that all processors in are non-faulty (this is possible since there are at most faulty processors), algorithm satisfies the Validity and Agreement conditions This should be rather clear since when is executed, each receives and registers the value As a result, all lieutenants agree on ‘s private value, causing the Validity and Agreement conditions to be upheld.
Induction step: Suppose that and that satisfies the Agreement and Validity conditions whenever Now assume that Our objective is to prove that also satisfies both conditions. Without loss of generality, we assume that the first processors are non-faulty and consider the two cases corresponding to a faulty or non-faulty general
The case of a faulty general : When is executed, general sends a value to each lieutenant These values may be arbitrary and different than ‘s private value given the general’s faulty nature.
The next step is for the algorithm to execute for each lieutenant First note that since we have . We can then use the induction hypothesis and assume that satisfies the Agreement and Validity conditions.
- If is non-faulty (i.e., ), its resulting vector will be of the form where the first entries are all equal to by virtue of ‘s Validity condition.
- If is faulty, its resulting vector must have the first entries all equal. Indeed, these are the values computed by the non-faulty lieutenants on behalf of the faulty processor and must all be equal by virtue of ‘s Agreement condition.
The subsequent majority function applied at the level of each non-faulty processor will then have the same set of inputs and as a result, compute the same output. This guarantees that satisfies the Agreement condition. The Validity condition is futile in this case since the general is known to be faulty.
The case of a non-faulty general : When is executed, general sends a value to each lieutenant They are all equal to ‘s private value.
is subsequently executed for each lieutenant Since we have . As a result, we can invoke the induction hypothesis and assume that satisfies the Agreement and Validity conditions.
- If is non-faulty (i.e., ), its resulting vector will be of the form where the first entries are all equal to by virtue of ‘s Validity condition.
- If is faulty, its resulting vector must have the first entries all equal. Indeed, these are the values computed by the non-faulty lieutenants on behalf of the faulty processor and must all be equal by virtue of ‘s Agreement condition.
Since and it must be that The majority of the lieutenants are thus non-faulty. The last step in the execution of will then guarantee that all non-faulty lieutenants compute the same value for ensuring as such that the Validity and Agreement conditions are observed.
4. FLP impossibility result
We now consider a different class of consensus problems for which no algorithm can always reach consensus in finite time. This was first stated and proved in [4] and came to be known as the FLP impossibility result. We start by defining the relevant consensus problem before we state and prove this seminal result.
System model: For this class of consensus problems, we consider systems with arbitrary network topologies consisting of a pre-defined set of static nodes or processors for some integer The underlying communication channel is assumed to be reliable and any faulty behavior is modeled at the level of the processor as we describe later under the node failure regime. No constraints are imposed on the nature of the messages which could be oral or signed. Most importantly, the class of systems considered are fully asynchronous.
In what follows, we introduce numerous definitions to help formalize the system model:
- Processors communicate by sending each-other messages. A message is defined to be a pair where is the destination processor and a message value destined to taken from a fixed message set
- A message system is a buffer of messages that have been sent but not yet received by their destined processor. Adding a message to is achieved by executing a send function:
which places in
- Removing a message from requires the execution of a receive function:
which does one of two things:
-
- Returns i.e., leaves unchanged, or
- Returns a message value taken from the subset of all messages in intended to and deletes from We say that message has been delivered.
- The function is subject to the condition that if is performed infinitely may times, every message intended to gets eventually delivered.
- The notion of asynchronicity is embedded within the definition of the receive function. Indeed, the function acts in a non-deterministic way by having the right to return a finite number of times in response to even though an intended message exists in Note that if this right were granted an infinite number of times, the aforementioned condition would fail to hold.
- Each processor is characterized by a set of attributes consisiting of:
- An input register whose value is a single bit.
- An internal storage unit of infinite capacity that we denote .
- A program counter that we refer to as .
- An output register that can take values from where denotes a value other than or .
- At any point in time, we can concisely represent the state of processor by the four-tuple We refer to it as the internal state of at time At each processor starts at an initial state characterized by an empty input register and output register set to
inititial state internal state
- By exchanging messages, processors change their internal states. A primitive step by processor consists of two phases:
- Call method and obtain a value
- Depending on ‘s internal state and on enters a new internal state and sends a finite number of messages to other processors (i.e., places them in by executing the function).
- The change of ‘s internal state is dictated by a deterministic transition function The only constraint on is that it cannot change the value of ‘s output register once reaches a decision (i.e., when ). In other words, the output register is write once. More formally, we can let denote the state space of i.e., the space of all four-tuples We let denote a discrete unit of time corresponding to when primitive step # was applied. The transition function can be generically defined as:
such that
- At any given time the system will be in a certain configuration which corresponds to the internal states of all processors in along with the content of the message buffer at time
- At the initial configuration of the system corresponds to the initial states and initial input register values of each processor as well as an empty message buffer
- Moving from configuration to occurs after the execution of primitive step # which is fully determined by a pair We refer to the receipt of by following primitive step # as the event Recall that could be as per the definition of the function. We say that one moves from to by applying event and write:
- The event can always be applied to any configuration and so it is always possible for a processor to take another step.
- We say that a configuration has decision value if some processor is in a decision state with This definition does not impose any restriction on the number of decision values that a configuration may have. Indeed, it is conceivable for different processors in a configuration to have reached different decision values. We will however impose a restriction when we later define the Agreement criterion of the consensus problem.
- A schedule starting at configuration is a finite or infinite sequence of events that can be sequentially applied to The associated sequence of steps that generates these specific events is called a run. A finite schedule of length starting at results in another configuration such that:
In this finite-length case, we say that is reachable from A configuration that is reachable from some initial configuration is said to be accessible.
Failure regime: The nodes are assumed to operate under a crash failure regime where a given processor can either be operational or dead. More specifically, we say that a processor is non-faulty in a given run if it can take infinitely many steps. This is a weaker version than the byzantine regime we considered in section 3. The justification for this choice lies in the fact that impossibility results that hold in a relatively basic failure regime would also hold in a stronger one including the byzantine model.
Consensus problem: We are now in a position to specify what is meant for an algorithm to reach consensus for this class of system models. To do so, we describe the Agreement, Validity and Termination criteria that an algorithm must observe if it were to solve the consensus problem:
1. Agreement: No accessible configuration can have more than one decision value.
2. Validity: some accessible configuration has decision value In other words, this criterion ensures that there are no trivial solutions to the consensus problem.
3. Termination: Before stating the Termination criterion, we define what is meant by an admissible and deciding run:
- A run is admissible if at most one processor is faulty and if all messages destined to non-faulty processors are eventually received.
- A run is deciding if some processor reaches a decision state in that run.
The Termination criterion requires every admissible run to be a deciding run. Note that this criterion only requires that some processor makes a decision rather than all processors deciding. Here too, an impossibility result that holds in this weaker context will certainly hold in the stronger setting that requires all processors to decide. An important observation is that the Termination criterion must hold deterministically i.e., every time the consensus algorithm is executed.
In [4], the authors refer to a consensus prototcol or algorithm that satisfies the Agreement and Validity conditions as partially correct. If it also satsfies the Termination criterion, then it is said to be totally correct in spite of one fault. The FLP impossibility result can then be stated as follows:
No consensus protocol is totally correct in spite of one fault
In order to prove this, the authors demonstrate that every partially correct protocol has some admissible run that is not a deciding run. In other words, if the Agreement and Validity conditions were respected then the Termination criterion would fail. We now turn to the reductio ad absurdum proof articulated in [4].
Proof of the FLP impossibility result: The gist of the proof consists in showing that if all three criteria are uphelp, then one could still find an admissible run that avoids taking any decision at all times, violating as such the Termination criterion. To do so, we proceed in two steps:
- We first show that there exists at least one initial configuration that admits at least two schedules leading to two different decision values. Such a characteristic is referred to as bivalency.
- We then show that given any bivalent configuration, there exists a schedule that leads to another bivalent configuration.
Intuitively, a bivalent configuration is one whose decision is not known a priori. Creating an inifnite chain of such configurations will clearly violate the Termination criterion.
Lemma A: In a totally correct consensus protocol in spite of one fault, there exists a bivalent initial configuration.
Let be a configuration at some time and let be the set of decision values of all configurations reachable from Clearly. must be a subset of i.e.,
- If we say that is bivalent.
- If () we say that is -valent (-valent).
We first claim that To see why, note the following:
- There always exists an admissible run starting at This is because by assumption, we consider systems where at most one processor is faulty and such that for all non-faulty processors , the condition we imposed on the function ensures that all messages destined to get eventually delivered.
- Since the system is assumed to be totally correct, every admissible run must also be a deciding run. As a result, the set of decision values of all configurations reachable from cannot be the empty set.
We now now proceed with a reductio ad absurdum proof of Lemma A.
- Suppose that Lemma A does not hold, i.e., in a totally correct consensus prototcol in spite of one fault, there does not exist any bivalent initial configuration.
- We already established that for any configuration In particular, If furthermore no bivalent initial configuartion exists, then any initial configuration must either be 0-valent or 1-valent.
- This result, coupled with the Validity criterion shows that there exists distinct initial configurations and such that is 0-valent and 1-valent (i.e., and
- Next, note that any two initial configurations differ only in the initial value of a subset of their processors. In other words:
where such that
- Now observe that one can transform any initial into another initial through a sequence of adjacent configurations where each configuration in the sequence differs from its neighbor(s) in the initial value of a single processor. For example, starting at one can apply the following steps to get to :
Step : Replace with (leave everything else intact):
Step : Replace with (leave everything else intact):
Step : Replace with and get
- Since any initial configuration must either be -valent or -valent, and since and have different valencies, it must be that in the sequence of adjacent configurations leading from to there exists a -valent initial configuration adjacent to a -valent initial configuration ( where the two differ only in the initial value of
- Consider an admissible run starting at initial configuration and such that processor is the only faulty processor and such that it is assumed to have crashed prior to starting the run. By the total correctness assumption, this admissible run must also be a deciding one. Let be its corresponding schedule.
- Since and differ only in ‘s initial value, and since this value is irrelevant to in the context of this run ( is assumed to be a dead processor that takes no steps in the run), one can apply the same schedule on the initial configuration Furthermore, the deterministic transition functions will ensure that the two runs on and result in the same decision.
- If the decision is then this would contradict ‘s -valency. Otherwise ‘s -valency would be contradicted. Q.E.D.
Next, we show that given a totally correct consensus protocol in spite of one fault, we can always derive a bivalent configuration from another bivalent one by applying an adequate sequence of events.
Lemma B: Let be a bivalent configuration at time Let be an event applicable to Let be the set of all configurations reachable from without applying and the set and is applicable to . In a totally correct consensus protocol in spite of one fault, we claim that must contain at least one bivalent configuration.
To prove it, we lean on a number of sub-lemmas. In the proofs below we drop the explicit dependence of a configuration on a particular time instance since knowledge of the exact time or step when a configuration materializes is not necessary for our purposes:
Sub-lemma B.1: The event is applicable to every configuration
- The event is clearly applicable to configuration (by the assumption in Lemma B).
- Furthermore, messages could be delayed arbitrarily (due to the asynchronous nature of the system model).
- As a result, one could arbitrarily delay the receipt of message value by processor Q.E.D.
Sub-lemma B.2: If the set does not contain any bivalent configuration, then it must contain both a -valent and a -valent configuration.
- Since is a bivalent configuration (by the assumption in Lemma B), there exists a 0-valent and 1-valent configurations and reachable from We now show how to derive a 0-valent configuration from that is an element of We can replicate the same logic to derive a 1-valent configuration from
- Two cases arise depending on whether is an element of or not.
- If let be the configuration This is possible bySub-lemma B.1. Clearly, by the definition of the set
- If then the event was applied sometime before reaching configuration Let be the configuration immediately obtained after applying In this case, is reachable from .
- If has no bivalent configuration, then must be univalent (we’ve shown as part of Lemma A that it cannot be ):
- In the first case above, is reachable from . Since is 0-valent, then so must be
- In the second case, is reachable from If were 1-valent, then would also have to be 1-valent. Since is 0-valent, then is 0-valent. Q.E.D.
Sub-lemma B.3: Two configurations are said to be neighbors if one can be reached from the other through the application of a single event. If has no bivalent configurations, then there must exist two neighboring configurations and such that configuration is 0-valent and configuration is 1-valent.
- By Sub-lemma B.2 we know that must contain both a 0-valent configuration and a 1-valent configuration Let and be the two configurations in such that and
- Since all the elements of are reachable from it must be that and are reachable from Let be the last common configuration in the two distinct paths from to and as depicted below:
- Suppose that for any two neighboring configurations and and cannot have different valences. We’ve seen as part of Lemma A that and cannot have an empty set of decision values either. Furthermore, being elements of they cannot be bivalent by the condition in Sub-lemma B.3. As a result, and must have the same valence.
- In particular, since configurations and are linked by a sequence of neighbors, it must be that and share the same valence. Given that is 0-valent, it must be that is 0-valent. By a similar argument, and using the sequence of neighbors linking and we can also conclude that is 1-valent. In other words, is bivalent.
- But and so A bivalent contradicts the initial assumption that has no bivalent configurations. Q.E.D.
Sub-lemma B.4 (“Commutativity property of schedules”): Suppose that from some configuration schedules and lead to configurations and respectively, for some If the two sets of processors taking steps in and are disjoint, then the application of to and to will result in the same configuration , for some
Without loss of generality, suppose that the system’s processor set consists of two distinct processors We will prove the sub-lemma for the simple case where the two schedules are disjoint singletons, i.e., and The general case can be analyzed using the same logic.
- Let be initially applied to The event corresponds to the receipt of message value by Recall that the receive function deletes from the message buffer and then depending on ‘s internal state and on the message value enters a new internal state and sends a finite set of messages to other processors.
- Let
- At we can write:
),
where is a set of newly generated messages and processors pairs.
- At we can write:
),
where is a set of newly generated messages and processors pairs.
- One can easily see that applying to and then applying to the resulting configuration would yield the same configuration
We are now in a position to prove Lemma B. Suppose that has no bivalent configurations. By Sub-lemma B.3, there must exist two neighboring configurations such that is 0-valent and is 1-valent ( is the event ). By virtue of being neighbors, we can assume without loss of generality that for some event We have two cases to consider:
- Case : We have Since then the two processors taking steps in and are disjoint. We can thus apply Sub-lemma B.4 to get This is not possible since a 1-valent configuration cannot be reached from a 0-valent one.
- Case : Consider an admissible run starting at and such that processor is the only faulty process and such that it is assumed to have crashed prior to starting the run. By the total correctness assumption, this admissible run must also be a deciding one. Let be its corresponding schedule and let be the resulting configuration. Clearly, the set does not have any common processors with events included in We can thus invoke Sub-lemma B.4 as portrayed in the diagram below:
Since is 0-valent, it must be that is 0-valent too (we have previously shown as part of Lemma A that its decision set cannot be ). Similarly, since is 1-valent, so must be Now note that and are both reachable from and have different valencies. must hence be bivalent. But is the outcome of a deciding run (by construction) and hence cannot be bivalent.
In both cases we reached a contradiction, demonstrating that must contain at least one bivalent configuration.
In order to prove the FLP impossibility result, we now use Lemma A and Lemma B to build an admissible non-deciding run for any consensus protocol that is totally correct in spite of one fault. We first build a particular class of admissible runs as follows:
- Maintain a queue of processors, originally in arbitrary order.
- For any given configuration, let its associated message buffer be ordered according to the time the messages were sent, earliest first.
- Define a stage to be a collection of one or more steps. A stage is completed when the first processor in the queue takes a step. In this step, the processor receives the earliest message destined to it in the message buffer, or if no messages are available. The processor is then moved to the back of the queue.
Note that this construction ensures that in any infinite sequence of such stages, every non-faulty processor (i.e., one that can take infinitely many steps) will receive every message sent to it. Such a run is hence admissible. We now derive a particular instance of a non-deciding run that belongs to this class of admissible runs. Let be any bivalent initial configuration (its existence is guaranteed by Lemma A), and repeat the following procedure for each bivalent configuration
- Let be the processor heading the processors queue at the time corresponding to configuration and the earliest message value destined to in the message buffer (if there is no such message, then ). Let be the event
- Lemma B guarantees the existence of a bivalent configuration reachable from through the application of a schedule where is the last event applied.
The previous procedure is actually an infinite loop characterizing an admissible run with no decision ever reached. Q.E.D.
Before we wrap up this chapter, we stress one more time the importance of defining clearly the system model attributes. For example, it suffices to substitute the deterministic nature of the Termination criterion with its randomized counterpart for the FLP result to stop holding as was proven in [1].
5. References
[1] Michael Ben-Or. Another advantage of free choice: Completely asynchronous agreement protocols. ACM, 1983.
[2] Cynthia Dwork and Nancy Lynch. Consensus in the presence of partial synchrony. Journal of the Association for Computing Machinery, 35(2):288-323, April 1988.
[3] Michael J. Fischer, Nancy Lynch, and Michael Merritt. Easy impossibility proofs for distributed consensus problems. ACM, 1985.
[4] Michael J. Fischer, Nancy Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the Association for Computing Machinery, 32(2):374-382, April 1985.
[5] Leslie Lamport, Robert Shostak, and Marchall Pease. The byzantine generals problem. ACM Transactions on programming Languages and Systems, 4(3):382-401,July 1982.
[6] M. pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults. Journal of the Association for Computing Machinery, 27(2):228-234, April 1980.
Tags: byzantine generals, consensus, FLP
No comments
Comments feed for this article
Trackback link: https://delfr.com/consensus-in-distributed-systems/trackback/