Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wire unified scheduler into banking experimentally #3946

Draft
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Dec 5, 2024

Problem

Currently, the unified scheduler can't be used as a block-production method.

Summary of Changes

(See Proposition/justification section at #3832 for justification of this pr in the first place)

This pr (and any spin-off ones) implements all the necessary plumbing from the cli to the innermost (SchedulingStateMachine) of the unified scheduler with minimal functionality. Note that, even after all the changes in this pr is merged, unified scheduler isn't fully functional nor performant nor secure for the banking stage in any sense.

as part of the plumbing, new thread management mechanism is introduced inside unified-scheduler-pool, parallel to the already-existing one for the unified scheduler in the block verification. As a quick recap, unified scheduler assigns independent thread pools to each active banks for ongoing forks while verifying blocks.

on the other hand, unified scheduler maintains one thread pool at most (as a singleton object) for the current working_bank of PohRecorder, when producing blocks. This means these two kinds of thread pools (one for block verification, other for block production) are managed differently. Other than that, most of core code (thread looping and SchedulingStateMachine) is still shared among them with limited number of branching (i.e. I'd say it's still unified lol).

This differentiated behavior at the higher layer is intentional to accommodate banking-stage specific requirements:

(1) existing relevant code base is strongly assuming non-concurrent block production even if there are multiple competing forks, which i think a justifiable design decision for impl simplicity. This means the channel receivers (nonvote_receiver, tpu_receiver, gossip_receiver) for incoming transactions are kind of a singleton set of resources and aren't fork-aware. Also, unified scheduler needs some preprocessing (SanitizedTransaction/TransactioView creation, alt resolution, and Pubkey->UsageQueue lookups). With all these considerations in mind, the channels are directly connected to unified-scheduler's handler thread main loop of a particular scheduler instance, in order to do the offloaded processing there multi-threadedly, while avoiding extra messaging pop for efficiency.

the actual integration are implemented with above-mentioned design goals. at validator startup, if BankingStage::new() is branched into new_unified_scheduler(), it does minimal setup even without spawning threads. Specifically, it just calls scheduler_pool.register_banking_stage(). its notable arg is the callback closure to convert BankingPacketBatches to SanitizedTransactions, which is actually run on handler threads.
The closure type is slightly involved because of the need to respawn thread pool from time to time to mitigate unbounded UsageQueueLoader growth. So, the closure is nested: the inner closure for convention is called BatchConverter. The outer closure to create the converter is called BatchConverterCreator. Along with the closure, the actual transaction-incoming channel and small monitor object (called BankingStageMonitor) are registered to the unified scheduler pool. These args are retained as-is, typed as BlockProductionSchedulerRespawner, internally. These separation incurs a single dynamic dispatch, but intentionally done for separation of concerns.

(2) transaction buffering (and prioritization-ordering) is desired to be run prior to the first bank creation and this buffer must be carried over to the first bank of leader slots and the next, etc. So, block-verifying unified scheduler's tight coupling of 1 Bank-to-1 thread pool isn't appropriate here.

towards that end, now thread pool can be started without a bank by making SchedulingContext::bank Option<_>-ed. Then, it enters into the buffering state immediately after the setup. Previously, this wasn't possible.

Also, scheduling session needed some adjustment to reach to safe suspension point quickly, so that next child bank could resume timely. As a quick recap, SchedulingStateMachine didn't previously buffer any non-conflicting ready-to-execute transactions. That meant these transactions are buffered inside crossbeam channels. And, its (potentially large) buffer has to be cleared before switching to the next bank, to maintain runtime invariant.

So, SchedulingStateMachine now starts to buffer transaction internally up-to fixed cap instead. And scheduler thread's main loop introduces new state for session, to be paired to the new buffering mechanism: session_pausing = true

Note that, this scheduling resumption mechanism doesn't work nicely when switching forks in the middle of leader slots. Specifically, it doesn't recover already-processed higher-paying txes, which are yet to be executed at the new fork. This behavior is also true for central scheduler.

As a bonus, session_resetting = true is also introduced to clear any remaining unprocessed buffered transactions after leader schedule.

The actual change list:
  1. preparations
  1. actual plumbing
  • Introduce SchedulingMode::{BlockVerification, BlockProduction} finally!
    • Make BankForks/InstalledScheduler aware of SchedulingMode.
  • Make SchedulingStateMachine buffer ready-to-execute tasks internally up to max_executing_task
  • Selectively gate the cli value of unified-scheduler for --block-production-method, with override flag (--enable-experimental-block-production-method)
  • Make PohRecorder::reset() return overwritten BankWithScheduler to reap now-unused scheduler from the bank wrapper.
  • Add minimal local-cluster test: test_randomly_mixed_block_production_methods_between_bootstrap_and_not
  • Make agave-ledger-tool simulate-block-production support --block-production-method unified-scheduler
  • Make solana-banking-bench simulate-block-production support --block-production-method unified-scheduler

Note that this pr is extracted from #2325

@ryoqun ryoqun force-pushed the unified-scheduler-in-banking-stage branch 7 times, most recently from 16c8b91 to ca24a20 Compare December 6, 2024 06:48
@ryoqun ryoqun force-pushed the unified-scheduler-in-banking-stage branch 4 times, most recently from 3af5124 to 9772f93 Compare December 10, 2024 13:07

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing new in sdk, please 🙏

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, you spotted that. :) 723f765

Comment on lines 897 to 908
// A write lock for the poh recorder must be grabbed for the entire duration of inserting new
// tpu bank into the bank forks. That's because any buffered transactions could immediately be
// executed after the bank forks update, when unified scheduler is enabled for block
// production. If transactions were executed prematurely, the unified scheduler would be hit
// with false errors due to having no bank in the poh recorder otherwise.
let mut poh_recorder = poh_recorder.write().unwrap();

let tpu_bank = bank_forks
.write()
.unwrap()
.insert_with_scheduling_mode(SchedulingMode::BlockProduction, tpu_bank);
poh_recorder.set_bank(tpu_bank, track_transaction_indexes);
Copy link
Member Author

@ryoqun ryoqun Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this (backref: #4123 (comment))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a concern here is that elsewhere we grab these locks in the opposite order.
I'm fairly certain we don't but given these are both so extensively used (smelly) I'm not convinced entirely that we do not.

Eventually there should be some other delivery system for TPU banks, other than locking poh. I'm not quite sure what that should look like though. Any thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a concern here is that elsewhere we grab these locks in the opposite order.

how about this? this removes the opposite locking order: d804e35 (this builds on top of revert: dbf4db4)

Eventually there should be some other delivery system for TPU banks, other than locking poh. I'm not quite sure what that should look like though. Any thoughts?

hmm, i don't think locking poh_recorder is so bad in general.. come to think of it, it's a bit odd that poh is being a singleton. i think we just can instantiate them on demand for new tpu bank (this most of time will end up there's one instance effectively given any time of leader schedule. but this should simplify or relax locking requirements).

Comment on lines 1442 to 1453
// bank_forks could be write-locked temporarily here, if unified scheduler is enabled for
// block production. That's because ReplayStage started inside Tvu could immediately try to
// insert a tpu bank into bank_forks, like with local development clusters consisting of a
// single staked node. Such insertion could be blocked with the lock held by unified
// scheduler until it's fully setup by the banking stage. This is intentional because
// completion of insertion needs to strictly correspond with the initiation of producing
// blocks in unified scheduler. Because Tpu setup follows Tvu setup, there's a corner case
// where the banking stage hasn't yet called register_banking_stage() to finish the unified
// scheduler setup.
// This means any setup which accesses bank_forks must be done before
// or after the short duration of Tvu and Tpu setups to avoid deadlocks. So,
// RootBankCache needs to be created here in advance as being one of such setups.
Copy link
Member Author

@ryoqun ryoqun Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this (backref: #4123 (comment))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the above comment on the lock order - feel it may be worth investing in a better way to deliver working banks to tpu to avoid these weird edge cases where we need to be very careful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that that wall of text couldn't help and i crossed your code-smell threshold nevertheless.. ;)

how about this? 110c1bc (note: ci won't pass)

if it's okay for you that unified scheduler code breaks out of the BankingStage code organization more aggressively like that, i can remove this weird edge case: 9839ab8

feel it may be worth investing in a better way to deliver working banks to tpu to avoid these weird edge cases where we need to be very careful.

fyi, this is completely different from the update_bank_forks_and_poh_recorder_for_new_tpu_bank thing (tpu delivery as you said), which i'll plan to address later with a different commit.

@ryoqun ryoqun requested a review from apfitzge December 18, 2024 16:15
@ryoqun ryoqun force-pushed the unified-scheduler-in-banking-stage branch from 426c794 to bac313a Compare December 20, 2024 00:47
@ryoqun ryoqun force-pushed the unified-scheduler-in-banking-stage branch from 266b03e to 3df47ba Compare January 12, 2025 04:50
@ryoqun ryoqun force-pushed the unified-scheduler-in-banking-stage branch from 763587f to 5a4a928 Compare January 14, 2025 05:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants