-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GreedyScheduler #4688
base: master
Are you sure you want to change the base?
GreedyScheduler #4688
Conversation
|
||
/// Receive completed batches of transactions without blocking. | ||
/// Returns (num_transactions, num_retryable_transactions) on success. | ||
fn receive_completed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is an exact copy of PrioGraphScheduler::receive_completed
.
Duplicating to avoid refactors in a PR intended to be backported to 2.1.
} | ||
} | ||
|
||
impl<Tx: TransactionWithMeta> GreedyScheduler<Tx> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything in this impl block is an exact copy from PrioGraphScheduler
.
Duplicating to avoid refactors in a PR intended to be backported to 2.1.
} | ||
} | ||
|
||
fn try_schedule_transaction<Tx: TransactionWithMeta>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is slightly different than the PrioGraphScheduler
version; we do not have strict priority guards for unschedulable transactions.
I intend to add this as an option in a future PR, but keeping this scheduler simpler for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by strict priority guards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example:
graph LR;
A((A));
AC((AC));
BC((BC));
A --> AC --> BC;
Let's say C
is write-locked on thread 0, and we schedule A
to thread 1.
AC
cannot be immediately scheduled because it has conflicts with multiple threads.
BC
can be immediately scheduled because it only conflicts with the C
lock on thread 0.
Right now, we would schedule BC
. A strict priority guard would mark A
and C
as blocked locks once we saw AC
, and stop other txs from taking those locks since the highest priority item (AC
) can't be scheduled at this time.
There's trade-offs to both approaches - strictness is more profitable if you can still fill the blocks, but reduces parallelism and reduces the chances to fill blocks. Non-strictness leads to these chances of out of order prioritization, and less consistent UX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok was thinking something along these lines, thanks!
There are sure to be some backport conflicts due to the lack of the generic transaction type in 2.1. The diff of this PR appears large:
|
// If there is a conflict with any of the transactions in the current batches, | ||
// we should immediately send out the batches, so this transaction may be scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also commenting this is only necessary until SIMD-0083
core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs
Outdated
Show resolved
Hide resolved
transaction_state, | ||
&pre_lock_filter, | ||
&mut self.account_locks, | ||
schedulable_threads, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using schedulable_threads
here so that only threads that are not at capacity are scheduled to.
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
21a83fd
to
9d815d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rollout plan? Make the scheduler opt in for v2.2, default in v2.3? Any dreams of backporting to v2.1?
Err(SchedulerError::DisconnectedRecvChannel(_)) => {} | ||
Err(SchedulerError::DisconnectedSendChannel(_)) => { | ||
warn!("Unexpected worker disconnect from scheduler") | ||
if use_greedy_scheduler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible to dedup any of this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: originally reply thought you were referring to somewhere else in this PR because I didn't look...
Probably possible. Tried to keep things as dead-simple and straight-forward for now as possible since this is for a backport.
Type-differences come into play here so we can't do assign scheduler and just go from there unless we introduce dyn
shit.
I'll probably declutter this by just using a macro...since it is literally all the same once we create the scheduler object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple is good
Fully intend to backport to v2.1 - already have the de-conflicted branch ready to go as soon as this merges. |
a182cb3
to
a515487
Compare
impl Default for GreedySchedulerConfig { | ||
fn default() -> Self { | ||
Self { | ||
max_scheduled_cus: MAX_BLOCK_UNITS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there actual use cases for it != MAX_BLOCK_UNITS (other than in tests)? Maybe can save one configure item.
fn default() -> Self { | ||
Self { | ||
max_scheduled_cus: MAX_BLOCK_UNITS, | ||
max_scanned_transactions_per_scheduling_pass: 1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why 1_000
isn't a const like the others?
Self { | ||
max_scheduled_cus: MAX_BLOCK_UNITS, | ||
max_scanned_transactions_per_scheduling_pass: 1000, | ||
target_transactions_per_batch: TARGET_NUM_TRANSACTIONS_PER_BATCH, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TARGET_NUM_TRANSACTIONS_PER_BATCH
is really an argument of Consumer, which Scheduler already has. Could save anther config item if Schedule can get this number from its Consumer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consumer
doesn't determine batches though, scheduler does. Consumer just executes what it's given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TARGET_NUM_TRANSACTIONS_PER_BATCH
is defined in Consumer tho
core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs
Outdated
Show resolved
Hide resolved
container.remove_by_id(id.id); | ||
} | ||
Err(TransactionSchedulingError::UnschedulableConflicts) => { | ||
num_unschedulable += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_unschedulable += 1; | |
saturating_add_assign!(num_unschedulable, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is limited by our buffer size. we cannot possibly store more than u64::MAX transactions in a buffer, so there's no risk of this overflowing.
prefer to avoid adding if statements everywhere in our code for something than will not ever happen. even if it's not strictly guaranteed by the single code line, it is by the context.
core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs
Outdated
Show resolved
Hide resolved
core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs
Outdated
Show resolved
Hide resolved
core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs
Outdated
Show resolved
Hide resolved
a515487
to
5cbd7c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on why you think prio-graph didn't perform well and why greedy can avoid those issues? To me, it looks like greedy is nice because it limits in-flight cu's to 25% of the block space. Seems like we could do the same thing for prio-graph. Also seems like prio-graph scheduler could scan more tx's.. only configured for 1000 right now in max_scanned_transactions_per_scheduling_pass
.
fn schedule<S: StateContainer<Tx>>( | ||
&mut self, | ||
container: &mut S, | ||
_pre_graph_filter: impl Fn(&[&Tx], &mut [bool]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do the filter metrics look like for the prio graph scheduler? Seems like we should probably run the status cache checks at the very least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the point of this filter was to prevent the graph from getting out of priority order - you can imagine:
graph LR;
A1((A1)) --> A2((A2)) --> A3((A3)) --> A4((A4)) --> A5((A5))
B1((B1)) --> B2((B2))
C1((C1)) --> C2((C2))
Let's say all the A txs have very high priority relative to B and C. All of the stuff will land on thread 1; and there's other stuff on the other threads.
The way prio-graph works is we'd pop A1
, B1
, C1
from the top of graph (no conflicts rn), and schedule them in a batch (A1, B1, C1)
.
But if A1
fails status checks, A2
is effectively delayed one batch now.
So we run the checks before graph insertion, A1
is dropped, and we schedule (A2, B1, C1)
.
Greedy scheduler approaches this differently and we don't take the hard attempt at batching like that - we just always try to schedule the highest priority.
So the first batch would be A1
, then A2
. So even though A2
is delayed, because A1
wasn't batched with anything else we can just let the worker run the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this explanation makes a lot of sense, thank you! So for greedy we could be sending batches with transactions that would get filtered out but that's fine because we already queued up the next high priority transaction in the worker's queue and the worker will just chew through its batches.
|
||
impl<Tx: TransactionWithMeta> GreedyScheduler<Tx> { | ||
pub(crate) fn new( | ||
consume_work_senders: Vec<Sender<ConsumeWork<Tx>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add some asserts that this vec cannot be empty and can't be longer than 64 elements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
impl Default for GreedySchedulerConfig { | ||
fn default() -> Self { | ||
Self { | ||
max_scheduled_cus: MAX_BLOCK_UNITS / 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why divide by 4? Is it so that we only have a quarter of a block's worth of transactions in flight at any given time? Can you add a comment explaining how you got to that constant or why you think it's reasonable?
Btw we divide this number by the number of worker threads (default 4, but unbounded because the total number of banking threads can be overridden by the SOLANA_BANKING_THREADS
env var) so by default max_scheduled_cus
is 3M cu's.. which can fit 2 max cu (1.4M) transactions. I think this is fine but we should consider having a min value of 1.4M for max_cu_per_thread
And final nit: I think I would prefer a different name like target_total_in_flight_cus
(and also max_cu_per_thread
-> target_in_flight_cu_per_thread
so that it's more clear that it's a "target" for "in-flight" cu's which can be exceeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are "soft" limits. We go until we exceed, not denying until hit. So if there's 2_999_999 in-flight, we would technically be able to schedule a 1.4m tx for a "true" max of 4_399_999
. This means so long as it's not 0, we will always be able to schedule at least one tx.
Yeah renaming them to target
is great, I agree max
is probably misleading due to the softness of the limits.
48M/4 was chosen as we just saw this number working well - we want to expose this as a knob for operators to tune, but not happening in the first version of this.
We want this number to be as small as possible, while we still fill blocks. It can't be too small otherwise the workers idle too much, but we don't want to make is so large that any very high-priority tx we receive in our block(s) can't be scheduled because we scheduled all our blocks at the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
} | ||
|
||
/// Dead-simple scheduler that is efficient but **really** greedy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on the fact that this scheduler will basically allocate up to X% (currently 25%) of the block space in a single pass with non-conflicting transactions? I think it's important to call out that it's greedy in terms of allocating work evenly to the consume workers, but not greedy in terms of only picking high pri transactions as one might assume. Then on each subsequent schedule operation, it will fill any space that was freed from completed work in order to keep workers at a consistent workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but not greedy in terms of only picking high pri transactions as one might assume.
I need to change this comment - this WAS how it originally worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}; | ||
|
||
// If there is a conflict with any of the transactions in the current batches, | ||
// we should immediately send out the batches, so this transaction may be scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was a little confusing. Can you elaborate that we are sending batches immediately because we want to start processing any blocking transactions immediately so that we can get them completed asap and unblock this high-priority transaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Batches cannot have conflicting transactions (until SIMD-0083 activation). So we track which accounts are used in the batches we're currently building. If there's a conflict with those batches, we send the batches out to the workers now. That way the transaction that was conflicting with one of the batches could be inserted into a new batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we're on the same page here. Was just asking you to improve the code comment with my suggested elaboration (add yours too while you're at it 😉 )
} | ||
} | ||
|
||
fn try_schedule_transaction<Tx: TransactionWithMeta>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by strict priority guards?
I've added some commits to address review comments. I will rebase and fixup before merge, but leaving as separate commits so it is easier for you to see how the review comments were addressed. @jstarry @tao-stones lmk if happy with the fixes and I'll squash them into the appropriate commits (makes backporting easier). |
self.send_batches(&mut batches)?; | ||
assert_eq!( | ||
num_scheduled, num_sent, | ||
"number of scheduled and sent transactions must match" | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why we need this assert. Also, shouldn't num_sent
be incremented here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah looks like CI just failed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheduled
means we put it into a batch.
num_sent
assert is simply a sanity check to make sure we actually sent everything we scheduled to the workers, and didn't just drop them somewhere above.
Yes - forgot to increment num_sent here, which caused a test to fail (it should)
Fix: 98eaa98
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - thanks for explaining all the details
98eaa98
to
18115e8
Compare
^recent push squished all the commits that came from review into the appropriate original commits |
18115e8
to
e861d97
Compare
Problem
Summary of Changes
Scheduler
trait to abstract specfici scheduling logic in theSchedulerController
GreedyScheduler
for new scheduling logicGreedyScheduler
use to CLI via--block-production-method central-scheduler-greedy
Fixes #