-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Wen restart aggregate last voted fork slots #33892
Wen restart aggregate last voted fork slots #33892
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #33892 +/- ##
========================================
Coverage 81.7% 81.8%
========================================
Files 836 835 -1
Lines 224834 225690 +856
========================================
+ Hits 183847 184687 +840
- Misses 40987 41003 +16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took an initial pass, will review the aggregate logic next.
I think if you wanted to you could split this PR into the repair changes that consume slots_to_repair_for_wen_restart
and then the part that updates it.
- Avoid waiting after first shred when repair is in wen_restart
…than shred_fetch_stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a first pass and have some suggestions.
core/src/repair/repair_service.rs
Outdated
@@ -630,7 +646,7 @@ impl RepairService { | |||
* timestamp().saturating_sub(slot_meta.first_shred_timestamp) | |||
/ 1_000; | |||
if ticks_since_first_insert | |||
< reference_tick.saturating_add(DEFER_REPAIR_THRESHOLD_TICKS) | |||
< reference_tick.saturating_add(defer_repair_threshold_ticks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we bypass the reference_tick
comparison completely if add_delay_after_first_shred == false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
core/src/repair/repair_service.rs
Outdated
@@ -613,7 +622,14 @@ impl RepairService { | |||
slot: Slot, | |||
slot_meta: &SlotMeta, | |||
max_repairs: usize, | |||
add_delay_after_first_shred: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe change name to throttle_requests_by_shred_tick
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
core/src/repair/repair_service.rs
Outdated
@@ -668,6 +684,7 @@ impl RepairService { | |||
slot, | |||
&slot_meta, | |||
max_repairs - repairs.len(), | |||
true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you add the param name in a comment like /*add_delay_after_first_shred*/ true
for readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
core/src/repair/repair_service.rs
Outdated
@@ -213,6 +213,8 @@ pub struct RepairInfo { | |||
pub repair_validators: Option<HashSet<Pubkey>>, | |||
// Validators which should be given priority when serving | |||
pub repair_whitelist: Arc<RwLock<HashSet<Pubkey>>>, | |||
// A given list of slots to repair when in wen_restart | |||
pub slots_to_repair_for_wen_restart: Option<Arc<RwLock<Vec<Slot>>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: s/slots_to_repair_for_wen_restart/wen_restart_repair_slots/
for brevity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
core/src/tvu.rs
Outdated
@@ -205,6 +206,7 @@ impl Tvu { | |||
repair_whitelist: tvu_config.repair_whitelist, | |||
cluster_info: cluster_info.clone(), | |||
cluster_slots: cluster_slots.clone(), | |||
slots_to_repair_for_wen_restart: slots_to_repair_for_wen_restart.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to clone()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't now, removed.
let mut slots_stake_map = HashMap::new(); | ||
for slot in last_voted_fork_slots { | ||
if slot <= &root_slot { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is last_voted_fork_slots
always sorted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's sorted in Gossip code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to add an assert here that it's sorted to make the assumptions explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that if it's not sorted, assertion doesn't help you because you exited early already. I'm letting it loop on all slots, a loop on vec should be really fast.
wen-restart/src/wen_restart.rs
Outdated
Ok(()) | ||
} | ||
|
||
fn aggregate_restart_last_voted_fork_slots( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think exit
from Validator:new
should be piped down to this level so the loop can be interrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
wen-restart/src/wen_restart.rs
Outdated
RestartState::Init => RestartState::LastVotedForkSlots, | ||
RestartState::LastVotedForkSlots => RestartState::HeaviestFork, | ||
RestartState::HeaviestFork => RestartState::Done, | ||
_ => panic!("Unexpected state {:?}", new_progress.state()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/_/RestartState::Done
so compilation will break if a new enum value is added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gossip/src/crds_value.rs
Outdated
pub fn new(from: Pubkey, now: u64, last_voted_hash: Hash, shred_version: u16) -> Self { | ||
Self { | ||
// This number is MAX_CRDS_OBJECT_SIZE - empty serialized RestartLastVotedForkSlots. | ||
const MAX_SPACE: usize = 824; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may be able to define this in terms of the other constants and use const_assert_eq!()
to make this less fragile. There are some similar patterns in legacy.rs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll take a look.
{ | ||
*wen_restart_repair_slots.write().unwrap() = filtered_slots; | ||
} | ||
write_wen_restart_records(wen_restart_path, progress)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this write handled by the caller in wait_for_wen_restart()
?
solana/wen-restart/src/wen_restart.rs
Line 220 in 1ceda56
increment_and_write_wen_restart_records(wen_restart_path, &mut progress)?; |
For less complicated code and more modular testing, I think:
- We leave all the record keeping into
pub fn wait_for_wen_restart()
, sowen_restart_path
never needs to be passed into these state transition functions. This also means for testing we don't have to continually test for all errors that records are written towen_restart_path
(we only need one test for that), we just have to test that the inner state transition function returns the expected state. - Each of the state transition functions called here
solana/wen-restart/src/wen_restart.rs
Lines 197 to 219 in 1ceda56
RestartState::Init => send_restart_last_voted_fork_slots( last_vote.clone(), blockstore.clone(), cluster_info.clone(), &mut progress, )?, RestartState::LastVotedForkSlots => aggregate_restart_last_voted_fork_slots( wen_restart_path, wait_for_supermajority_threshold_percent, cluster_info.clone(), bank_forks.clone(), wen_restart_repair_slots.clone().unwrap(), exit.clone(), &mut progress, )?, RestartState::HeaviestFork => { warn!("Not implemented yet, make it empty to complete test") } RestartState::FinishedSnapshot => return Err("Not implemented!".into()), RestartState::GeneratingSnapshot => return Err("Not implemented!".into()), RestartState::WaitingForSupermajority => return Err("Not implemented!".into()), RestartState::Done => return Ok(()), } aggregate_restart_last_voted_fork_slots()
should have anenum AggregateForksError()
. This achieves a few things:
- For debugging, we know where each of the errors originated
- For testing, we can just test that every error returns the expected state. Then from 1. above, we don't have to worry about testing that that it's written to
wen_restart_path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally put all book keeping in wait_for_wen_restart, but then realized that AggregateLastVotedForkSlots might span several hours when we wait for validators to join the restart, then I decided we want to write received LastVotedForkSlots inside this loop so we can speed up internal state population in case the validator restarts.
What's your take on this? We can of course rely on others sending us the Gossip messages as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason we do want to keep record of the Gossip messages we received is that we can maybe debug later why the restart didn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding one error type per internal function, I added one WenRestartError for all functions in the crate, so we can do ?; everywhere. How does this look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally put all book keeping in wait_for_wen_restart, but then realized that AggregateLastVotedForkSlots might span several hours when we wait for validators to join the restart, then I decided we want to write received LastVotedForkSlots inside this loop so we can speed up internal state population in case the validator restarts.
You're totally right, let's keep the record writing inside the function then, and test those records are properly written
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding one error type per internal function, I added one WenRestartError for all functions in the crate, so we can do ?; everywhere. How does this look?
Think it's better to still organize per state transition for testing, see comment below
wen-restart/src/wen_restart.rs
Outdated
#[error("Protobuf Decode error: {0}")] | ||
DecodeError(#[from] prost::DecodeError), | ||
#[error("Protobuf Encode error: {0}")] | ||
EncodeError(#[from] prost::EncodeError), | ||
#[error("Exiting")] | ||
Exiting, | ||
#[error("Invalid Last Vote Type: {0:?}")] | ||
InvalidLastVoteType(VoteTransaction), | ||
#[error("File IO error: {0}")] | ||
IoError(#[from] std::io::Error), | ||
#[error("Missing last voted fork slots")] | ||
MissingLastVotedForkSlots, | ||
#[error("Hash parse error: {0}")] | ||
ParseHashError(#[from] solana_sdk::hash::ParseHashError), | ||
#[error("Pubkey parse error: {0}")] | ||
PubkeyParseError(#[from] solana_sdk::pubkey::ParsePubkeyError), | ||
#[error("Gossip push RestartLastVotedForkSlotsError: {0}")] | ||
RestartLastVotedForkSlotsError(#[from] RestartLastVotedForkSlotsError), | ||
#[error("Unexpected state: {0:?}")] | ||
UnexpectedState(wen_restart_proto::State), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to organize these by state transition:
pub enum SendLastVotedForkError {
}
pub enum LastVotedForkSlotsError {
}
- Then
wait_for_wen_restart()
still returnsBox<dyn std::error::Error>
- For testing for each state transition, we make sure all of their particular errors write properly to the record and we can restart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, when you say "write properly to the record", do you mean writing to the logs or to the protobuf as well? Since some of the errors might be transient, I didn't plan to write the errors to the protobuf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't mean write the error out. I mean the progress
data structure you write is what you expect after the error happens.
For every single error, for every single state transition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the change but it actually felt wrong IMHO, several reasons:
- complicating public API for the sake of testing
pub enum xxxError is public interface, if you think about how it's used, all people see from outside the module is wait_for_wen_restart, so it feels natural to have WenRestartError returned, it feels unnatural to have several Error enum returned, leaking our implementation details (which they shouldn't care about). - harder to write tests
I could of course test each function individually, but I do have to test they work in wait_for_wen_restart. If I only need to test is_err() then it's all fine, but if I want to test for specific error (e.g. file permission wrong), I had to do a downcast on dyn error, I had to constantly look back to error definitions to see what type of error I'm now expecting - common functions harder to implement and change
The fact that we do need to write the protobuf means I need to call write_wen_restart() from several places, but they will return different xxxError types, so I need to create yet another WriteWenRestartError and convert to other Error types, this is code duplication and unnecessary conversion. If we move part of the function to another internal method, we need to create yet another Error struct?
All in all, this feels wrong, maybe I'm misunderstanding what you described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub enum xxxError is public interface, if you think about how it's used, all people see from outside the module is wait_for_wen_restart, so it feels natural to have WenRestartError returned, it feels unnatural to have several Error enum returned, leaking our implementation details (which they shouldn't care about).
Here we're not writing a standalone library so I think a boxed error is fine. The purpose of this error is to make attribution of errors/debugging easier, which categorizing the errors does. We can now trace exactly which step in the restart went wrong. I think we could also introduce a top level WenRestartError
that wraps the errors like so:
enum WenRestartError {
SendLastVotedForkError(SendLastVotedForkError),
AggregateLastVotedForkSlotsError(AggregateLastVotedForkSlotsError),
...
}
I could of course test each function individually, but I do have to test they work in wait_for_wen_restart. If I only need to test is_err() then it's all fine, but if I want to test for specific error (e.g. file permission wrong), I had to do a downcast on dyn error, I had to constantly look back to error definitions to see what type of error I'm now expecting
Hmm I don't think you have to test every error case in wait_for_wen_restart()
. I think if the state transition functions are a-> b -> c
, then:
- You just have to test that each transition
a
,b
, andc
individually will run successfully to "completion" across all cases. "Completion" in this case means the ending output passes the verification for the initial input of the next state transition. Concretely this means testing in the happy case the state is correct, and that on all declared errors, they can recover and then run to completion. - From 1, inductively, you can then conclude that a -> b -> c will run correctly.
- All you need to test then in
wait_for_wen_restart()
is that the order of callsa -> b -> c
is correct and the arguments are being passed correctly. Concretely this means just testing one happy path.
The fact that we do need to write the protobuf means I need to call write_wen_restart() from several places, but they will return different xxxError types, so I need to create yet another WriteWenRestartError and convert to other Error types, this is code duplication and unnecessary conversion. If we move part of the function to another internal method, we need to create yet another Error struct?
I think a unique error for write_wen_restart
is fine, it is just a suberror of each state transition. This way we know which step the write failed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it more, I think I agree that failures should be tested on individual functions rather than wait_on_wen_restart(), so I changed accordingly.
The only thing I'm hesitating is that I still don't see how splitting WenRestartError into individual errors helps us. If the proto file is suddenly not writable, we need to solve that problem no matter what stage WenRestart is in, so I don't think replicating file IO errors everywhere really helps the user, it just complicates the public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this, we guarantee no errors on the state transitions 😄 . I.e. we guarantee the state going in and the state coming out of the records are always valid.
- All of the current errors are verification errors on the initial state:
solana/wen-restart/src/wen_restart.rs
Lines 88 to 92 in 645452f
return Err(WenRestartError::MissingLastVotedForkSlots); } } else { return Err(WenRestartError::InvalidLastVoteType(last_vote)); }
https://github.com/solana-labs/solana/pull/33892/files#diff-c32a5c15bdd23820489e6f8af8b305aee344228d0cb6789e554a988d57d54bd7R70-R71. These validations can be done in the beginning on entry into the state machine inwait_for_wen_restart
, as avalidate_state_1()
,validate_state_2()
, etc. These can then be unit tested. Ideally I think we would changeWenRestartProgress
to be an enum with a separate state for each step in this state machine, i.e.
enum WenRestartProgress {
Init,
SendForkSlots(SendForkSlots),
AggregateForkslots(AggregateForkSlots),
}
and then at the end of each state transition we just spit out the next iteration of the enum
- Then the only acceptable error is logging errors which we can then enforce is the only acceptable error.
- Then if we test that
write_wen_restart_records()
either writes a whole valid record OR doesn't write a record/i.e. errors, then we know the state must always be valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move the initialization errors into initialize(), but I can't guarantee there are no other errors after initialize(). For example, push_restart_last_voted_fork_slots() could return error. And we might want to throw out error in the future while running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right on this 😃
If we don't want a separate enum per state transition, then at minimum I would like to be able to print out the backtrace of where that error occurred. Maybe this: https://stackoverflow.com/questions/75060346/how-can-i-get-file-and-line-number-where-an-error-was-returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to anyhow instead of thiserror for backtrace, also moved initialization errors into initialize().
This reverts commit 4c920cb.
wen-restart/src/wen_restart.rs
Outdated
} | ||
None => { | ||
// repair and restart option does not work without last voted slot. | ||
if let VoteTransaction::Vote(ref vote) = last_vote { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this error for validators who have a default tower at startup and haven't voted yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default tower has an empty slots list, so it will not fail on line 91, instead it will fail on line 88.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that what we want or do we want validators with a default tower to be able to perform this restart as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if you have no last vote information, not sure you can contribute anything to the restart. I guess those guys can copy the generated snapshot like today. Or we can try to use information from the ledger if tower is not available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine, we should have a clear error message telling those guys they need to wait then.
I think it would be a better user experience if you verified the tower beforehand and printed a clear error message if the tower is not valid/missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, I added some error log there, see if this helps?
10d4afa
to
e3d0194
Compare
wen-restart/src/wen_restart.rs
Outdated
Cannot find last voted slot in the tower storage, it either means that this node has never \ | ||
voted or the tower storage is corrupted. Unfotunately, since WenRestart is a consensus protocol \ | ||
depending on each participant to send their last voted fork slots, your validator cannot participate.\ | ||
Please wait in the discord channel for the result of WenRestart, then generate a snapshot and use \ | ||
--wait-for-supermajority to restart the validator."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfotunately->Unfortunately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
wen-restart/src/wen_restart.rs
Outdated
Cannot find last voted slot in the tower storage, it either means that this node has never \ | ||
voted or the tower storage is corrupted. Unfotunately, since WenRestart is a consensus protocol \ | ||
depending on each participant to send their last voted fork slots, your validator cannot participate.\ | ||
Please wait in the discord channel for the result of WenRestart, then generate a snapshot and use \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please check discord for the conclusion of the WenRestart protocol, then generate a snapshot and use...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
wen-restart/src/wen_restart.rs
Outdated
) | ||
.err() | ||
.unwrap() | ||
.to_string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this .err().unwrap().to_string()
pattern from all the tests? We can just do an
assert_eq!(initialize(), Err())
or assert!(matches!(result, Err())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately you can't directly use assert_eq or even assert_match on anyhow::Error:
https://users.rust-lang.org/t/issues-in-asserting-result/61198/5
I have to use a unwrap_error().downcast().unwrap(), would you prefer that instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow so ugly 🤮
yeah, I think let's downcast and match on the actual error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Problem
Aggregate RestartLastVotedForkSlots in wen_restart and repair if necessary.
Summary of Changes
The Gossip changes are being reviewed separately in #33613.