Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault tolerance with redundancy and leader election #80

Merged
merged 33 commits into from
Jan 30, 2024
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
300f942
Added heartbeat bully leader election example
edwardalee Aug 13, 2023
8d9437b
Typo
edwardalee Aug 13, 2023
044fbaf
Another fault tolerance strategy
edwardalee Aug 15, 2023
f92d08d
Have switch failing
edwardalee Aug 15, 2023
ec1ae5d
Ping NRP
edwardalee Aug 15, 2023
d572b19
Regularized how to test failures
edwardalee Aug 16, 2023
60c4461
Formatted
edwardalee Aug 16, 2023
30e5ae0
Switch supports broadcast ping
edwardalee Aug 16, 2023
30bae7d
Request new NRP handled
edwardalee Aug 16, 2023
022ff27
Added detection of network partitioning and test showing that we don'…
edwardalee Aug 16, 2023
28898c9
Removed FIXME
edwardalee Aug 19, 2023
cb28a05
Updated FIXMEs and formatted
edwardalee Aug 19, 2023
573107d
Updated FIXMEs
edwardalee Aug 19, 2023
2ad8cd0
Refined leader election and fixed bug
edwardalee Aug 28, 2023
ef9269d
Fix reference and update FIXME about not running federated
edwardalee Sep 8, 2023
220d6d1
Merge branch 'main' into leader-election
edwardalee Sep 8, 2023
e10b187
Added times to printf and timeout
edwardalee Sep 8, 2023
85737db
Made example federated
edwardalee Sep 8, 2023
435aa26
Use federated execution
edwardalee Sep 13, 2023
22d2847
Merge branch 'main' into leader-election
edwardalee Sep 13, 2023
ae076cf
Formatted
edwardalee Sep 13, 2023
fe2c26a
Merge branch 'main' into leader-election
lhstrh Sep 14, 2023
fb24c19
Merge branch 'main' into leader-election
edwardalee Oct 16, 2023
650ac03
Use multiport and message server
edwardalee Oct 31, 2023
aef6305
Updated to use multiports and make federated
edwardalee Jan 15, 2024
358b7a2
Avoid overwriting heartbeat with ping
edwardalee Jan 15, 2024
a89e281
Format
edwardalee Jan 15, 2024
eccf08d
Tuned docs and formatting
edwardalee Jan 16, 2024
676ce07
Added READMEs
edwardalee Jan 16, 2024
ef261c8
Timeout pending
edwardalee Jan 24, 2024
e37c662
Merge branch 'main' into leader-election
edwardalee Jan 24, 2024
6f86704
Merge branch 'main' into leader-election
edwardalee Jan 30, 2024
39b70cb
Update to remove FIXMEs
edwardalee Jan 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
202 changes: 202 additions & 0 deletions examples/C/src/leader-election/HeartbeatBully.lf
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
/**
* This program models a redundant fault tolerant system where a primary node, if and when it fails,
* is replaced by one of several backup nodes. The protocol is described in this paper:
*
* Bjarne Johansson; Mats Rågberger; Alessandro V. Papadopoulos; Thomas Nolte, "Heartbeat Bully:
* Failure Detection and Redundancy Role Selection for Network-Centric Controller," Proc. of the
* 46th Annual Conference of the IEEE Industrial Electronics Society (IECON), 8-21 October 2020.
* https://doi.org/10.1109/IECON43393.2020.9254494
*
* The program has a bank of redundant nodes where exactly one is the primary node and the rest are
* backups. The primary node is always the one with the highest bank index that has not failed. The
* primary sends a heartbeat message once per second (by default). When the primary fails, a leader
* election protocol selects a new primary which then starts sending heartbeat messages. The program
* is set so that each primary fails after sending three heartbeat messages. When all nodes have
* failed, then the program exits.
*
* This example is designed to be run as a federated program with decentralized coordination.
* However, as of this writing, bugs in the federated code generator cause the program to fail
* because all federates get the same bank_index == 0. This may be related to these bugs:
*
* - https://github.com/lf-lang/lingua-franca/issues/1961
edwardalee marked this conversation as resolved.
Show resolved Hide resolved
* - https://github.com/lf-lang/lingua-franca/issues/1962
*
* When these bugs are fixed, then the federated version should operate exactly the same as the
* unfederated version except that it will become possible to kill the federates instead of having
* them fail on their own. The program should also be extended to include STP violation handlers to
* deal with the fundamental CAL theorem limitations, where unexpected network delays make it
* impossible to execute the program as designed. For example, if the network becomes partitioned,
* then it becomes possible to have two primary nodes simultaneously active.
*
* @author Edward A. Lee
* @author Marjan Sirjani
*/
target C {
timeout: 30 s
}

preamble {=
#include "platform.h" // Defines PRINTF_TIME
enum message_type {
heartbeat,
reveal,
sorry
};
typedef struct message_t {
enum message_type type;
int id;
} message_t;
=}

reactor Node(
bank_index: int = 0,
num_nodes: int = 3,
heartbeat_period: time = 1 s,
max_missed_heartbeats: int = 2,
primary_fails_after_heartbeats: int = 3) {
input[num_nodes] in: message_t
output[num_nodes] out: message_t

state heartbeats_missed: int = 0
state primary_heartbeats_counter: int = 0

initial mode Idle {
reaction(startup) -> reset(Backup), reset(Primary) {=
lf_print(PRINTF_TIME ": Starting node %d", lf_time_logical_elapsed(), self->bank_index);
if (self->bank_index == self->num_nodes - 1) {
lf_set_mode(Primary);
} else {
lf_set_mode(Backup);
}
=}
}

mode Backup {
timer t(heartbeat_period, heartbeat_period)
reaction(in) -> out, reset(Prospect) {=
int primary_id = -1;
for (int i = 0; i < in_width; i++) {
if (in[i]->is_present && in[i]->value.id != self->bank_index) {
if (in[i]->value.type == heartbeat) {
if (primary_id >= 0) {
lf_print_error("Multiple primaries detected!!");
}
primary_id = in[i]->value.id;
lf_print(PRINTF_TIME ": Node %d received heartbeat from node %d.", lf_time_logical_elapsed(), self->bank_index, primary_id);
self->heartbeats_missed = 0;
} else if (in[i]->value.type == reveal && in[i]->value.id < self->bank_index) {
// NOTE: This will not occur if the LF semantics are followed because
// all nodes will (logically) simultaneously detect heartbeat failure and
// transition to the Prospect mode. But we include this anyway in case
// a federated version experiences a fault.

// Send a sorry message.
message_t message;
message.type = sorry;
message.id = self->bank_index;
lf_set(out[in[i]->value.id], message);
lf_print(PRINTF_TIME ": Node %d sends sorry to node %d", lf_time_logical_elapsed(), self->bank_index, in[i]->value.id);
// Go to Prospect mode to send reveal to any higher-priority nodes.
lf_set_mode(Prospect);
}
}
}
// FIXME
// =} STP (0) {=
// FIXME: What should we do here.
// lf_print_error("Node %d had an STP violation. Ignoring heartbeat as if it didn't arrive at all.", self->bank_index);
=}

reaction(t) -> reset(Prospect) {=
if (self->heartbeats_missed > self->max_missed_heartbeats) {
lf_set_mode(Prospect);
}
// Increment the counter so if it's not reset to 0 by the next time,
// we detect the missed heartbeat.
self->heartbeats_missed++;
=}
}

mode Primary {
timer heartbeat(0, heartbeat_period)
reaction(heartbeat) -> out, reset(Failed) {=
if (self->primary_heartbeats_counter++ >= self->primary_fails_after_heartbeats) {
// Stop sending heartbeats.
lf_print(PRINTF_TIME ": **** Primary node %d fails.", lf_time_logical_elapsed(), self->bank_index);
lf_set_mode(Failed);
} else {
lf_print(PRINTF_TIME ": Primary node %d sends heartbeat.", lf_time_logical_elapsed(), self->bank_index);
for (int i = 0; i < out_width; i++) {
if (i != self->bank_index) {
message_t message;
message.type = heartbeat;
message.id = self->bank_index;
lf_set(out[i], message);
}
}
}
=}
}

mode Failed {
}

mode Prospect {
logical action wait_for_sorry
reaction(reset) -> out, wait_for_sorry {=
lf_print(PRINTF_TIME ": ***** Node %d entered Prospect mode.", lf_time_logical_elapsed(), self->bank_index);
// Send a reveal message with my ID in a bid to become primary.
// NOTE: It is not necessary to send to nodes that have a lower
// priority than this node, but the connection is broadcast, so
// we send to all.
message_t message;
message.type = reveal;
message.id = self->bank_index;
for (int i = self->bank_index + 1; i < self->num_nodes; i++) {
lf_print(PRINTF_TIME ": Node %d sends reveal to node %d", lf_time_logical_elapsed(), self->bank_index, i);
lf_set(out[i], message);
}
// The reveal message is delayed by heartbeat_period, and if
// there is a sorry response, it too will be delayed by heartbeat_period,
// so the total logical delay is twice heartbeat_period.
lf_schedule(wait_for_sorry, 2 * self->heartbeat_period);
=}

reaction(in) -> out {=
for (int i = 0; i < in_width; i++) {
if (in[i]->value.type == reveal && in[i]->value.id < self->bank_index) {
// Send a sorry message.
message_t message;
message.type = sorry;
message.id = self->bank_index;
lf_set(out[in[i]->value.id], message);
lf_print(PRINTF_TIME ": Node %d sends sorry to node %d", lf_time_logical_elapsed(), self->bank_index, in[i]->value.id);
}
}
=}

reaction(wait_for_sorry) in -> reset(Backup), reset(Primary) {=
// Check for sorry messages.
// Sorry messages are guaranteed to be logically simultaneous
// with the wait_for_sorry event, so we just need to check for
// presence of sorry inputs.
int i;
for (i = 0; i < in_width; i++) {
if (in[i]->is_present && in[i]->value.type == sorry) {
// A sorry message arrived. Go to Backup mode.
lf_set_mode(Backup);
break;
}
}
if (i == in_width) {
// No sorry message arrived. Go to Primary mode.
lf_set_mode(Primary);
}
=}
}
}

federated reactor(num_nodes: int = 4, heartbeat_period: time = 1 s) {
nodes = new[num_nodes] Node(num_nodes=num_nodes, heartbeat_period=heartbeat_period)
nodes.out -> interleaved(nodes.in) after heartbeat_period
}
Loading