Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GreedyScheduler #4688

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Conversation

apfitzge
Copy link

Problem

  • PrioGraphScheduler performance struggles when network activity gets very high (see recent activity)

Summary of Changes

  • Add a Scheduler trait to abstract specfici scheduling logic in the SchedulerController
  • Add GreedyScheduler for new scheduling logic
  • Hookup GreedyScheduler use to CLI via --block-production-method central-scheduler-greedy

Fixes #


/// Receive completed batches of transactions without blocking.
/// Returns (num_transactions, num_retryable_transactions) on success.
fn receive_completed(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is an exact copy of PrioGraphScheduler::receive_completed.
Duplicating to avoid refactors in a PR intended to be backported to 2.1.

}
}

impl<Tx: TransactionWithMeta> GreedyScheduler<Tx> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything in this impl block is an exact copy from PrioGraphScheduler.
Duplicating to avoid refactors in a PR intended to be backported to 2.1.

}
}

fn try_schedule_transaction<Tx: TransactionWithMeta>(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is slightly different than the PrioGraphScheduler version; we do not have strict priority guards for unschedulable transactions.

I intend to add this as an option in a future PR, but keeping this scheduler simpler for now.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by strict priority guards?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example:

graph LR;
A((A));
AC((AC));
BC((BC));

A --> AC --> BC;
Loading

Let's say C is write-locked on thread 0, and we schedule A to thread 1.
AC cannot be immediately scheduled because it has conflicts with multiple threads.
BC can be immediately scheduled because it only conflicts with the C lock on thread 0.

Right now, we would schedule BC. A strict priority guard would mark A and C as blocked locks once we saw AC, and stop other txs from taking those locks since the highest priority item (AC) can't be scheduled at this time.
There's trade-offs to both approaches - strictness is more profitable if you can still fill the blocks, but reduces parallelism and reduces the chances to fill blocks. Non-strictness leads to these chances of out of order prioritization, and less consistent UX.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok was thinking something along these lines, thanks!

@apfitzge
Copy link
Author

apfitzge commented Jan 29, 2025

There are sure to be some backport conflicts due to the lack of the generic transaction type in 2.1.

The diff of this PR appears large:

  • ~250 lines are for the greedy scheduler structure and logic itself.
  • ~150 lines are duplicated (made comments) from the PrioGraphScheduler.
  • ~350 lines of tests (much are duplicated from PrioGraphScheduler)

// If there is a conflict with any of the transactions in the current batches,
// we should immediately send out the batches, so this transaction may be scheduled.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also commenting this is only necessary until SIMD-0083

transaction_state,
&pre_lock_filter,
&mut self.account_locks,
schedulable_threads,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using schedulable_threads here so that only threads that are not at capacity are scheduled to.

@apfitzge apfitzge added the v2.1 Backport to v2.1 branch label Jan 29, 2025
@apfitzge apfitzge self-assigned this Jan 29, 2025
Copy link

mergify bot commented Jan 29, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

@apfitzge apfitzge marked this pull request as ready for review January 29, 2025 21:10
@apfitzge apfitzge force-pushed the greedy_fuck_scheduler branch from 21a83fd to 9d815d1 Compare January 30, 2025 21:48
Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rollout plan? Make the scheduler opt in for v2.2, default in v2.3? Any dreams of backporting to v2.1?

core/src/banking_stage.rs Outdated Show resolved Hide resolved
Err(SchedulerError::DisconnectedRecvChannel(_)) => {}
Err(SchedulerError::DisconnectedSendChannel(_)) => {
warn!("Unexpected worker disconnect from scheduler")
if use_greedy_scheduler {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to dedup any of this code?

Copy link
Author

@apfitzge apfitzge Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: originally reply thought you were referring to somewhere else in this PR because I didn't look...

Probably possible. Tried to keep things as dead-simple and straight-forward for now as possible since this is for a backport.

Type-differences come into play here so we can't do assign scheduler and just go from there unless we introduce dyn shit.
I'll probably declutter this by just using a macro...since it is literally all the same once we create the scheduler object.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple is good

@apfitzge
Copy link
Author

What's the rollout plan? Make the scheduler opt in for v2.2, default in v2.3? Any dreams of backporting to v2.1?

Fully intend to backport to v2.1 - already have the de-conflicted branch ready to go as soon as this merges.
Opt-in for now.

@apfitzge apfitzge force-pushed the greedy_fuck_scheduler branch 2 times, most recently from a182cb3 to a515487 Compare January 31, 2025 22:52
impl Default for GreedySchedulerConfig {
fn default() -> Self {
Self {
max_scheduled_cus: MAX_BLOCK_UNITS,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there actual use cases for it != MAX_BLOCK_UNITS (other than in tests)? Maybe can save one configure item.

fn default() -> Self {
Self {
max_scheduled_cus: MAX_BLOCK_UNITS,
max_scanned_transactions_per_scheduling_pass: 1000,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why 1_000 isn't a const like the others?

Self {
max_scheduled_cus: MAX_BLOCK_UNITS,
max_scanned_transactions_per_scheduling_pass: 1000,
target_transactions_per_batch: TARGET_NUM_TRANSACTIONS_PER_BATCH,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TARGET_NUM_TRANSACTIONS_PER_BATCH is really an argument of Consumer, which Scheduler already has. Could save anther config item if Schedule can get this number from its Consumer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consumer doesn't determine batches though, scheduler does. Consumer just executes what it's given.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TARGET_NUM_TRANSACTIONS_PER_BATCH is defined in Consumer tho

core/src/banking_stage/transaction_scheduler/scheduler.rs Outdated Show resolved Hide resolved
container.remove_by_id(id.id);
}
Err(TransactionSchedulingError::UnschedulableConflicts) => {
num_unschedulable += 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
num_unschedulable += 1;
saturating_add_assign!(num_unschedulable, 1);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is limited by our buffer size. we cannot possibly store more than u64::MAX transactions in a buffer, so there's no risk of this overflowing.

prefer to avoid adding if statements everywhere in our code for something than will not ever happen. even if it's not strictly guaranteed by the single code line, it is by the context.

@apfitzge apfitzge force-pushed the greedy_fuck_scheduler branch from a515487 to 5cbd7c0 Compare February 3, 2025 15:44
Copy link

@jstarry jstarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on why you think prio-graph didn't perform well and why greedy can avoid those issues? To me, it looks like greedy is nice because it limits in-flight cu's to 25% of the block space. Seems like we could do the same thing for prio-graph. Also seems like prio-graph scheduler could scan more tx's.. only configured for 1000 right now in max_scanned_transactions_per_scheduling_pass.

fn schedule<S: StateContainer<Tx>>(
&mut self,
container: &mut S,
_pre_graph_filter: impl Fn(&[&Tx], &mut [bool]),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do the filter metrics look like for the prio graph scheduler? Seems like we should probably run the status cache checks at the very least?

Copy link
Author

@apfitzge apfitzge Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the point of this filter was to prevent the graph from getting out of priority order - you can imagine:

graph LR;
A1((A1)) --> A2((A2)) --> A3((A3)) --> A4((A4)) --> A5((A5))
B1((B1)) --> B2((B2))
C1((C1)) --> C2((C2))
Loading

Let's say all the A txs have very high priority relative to B and C. All of the stuff will land on thread 1; and there's other stuff on the other threads.

The way prio-graph works is we'd pop A1, B1, C1 from the top of graph (no conflicts rn), and schedule them in a batch (A1, B1, C1).
But if A1 fails status checks, A2 is effectively delayed one batch now.
So we run the checks before graph insertion, A1 is dropped, and we schedule (A2, B1, C1).

Greedy scheduler approaches this differently and we don't take the hard attempt at batching like that - we just always try to schedule the highest priority.
So the first batch would be A1, then A2. So even though A2 is delayed, because A1 wasn't batched with anything else we can just let the worker run the check.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this explanation makes a lot of sense, thank you! So for greedy we could be sending batches with transactions that would get filtered out but that's fine because we already queued up the next high priority transaction in the worker's queue and the worker will just chew through its batches.


impl<Tx: TransactionWithMeta> GreedyScheduler<Tx> {
pub(crate) fn new(
consume_work_senders: Vec<Sender<ConsumeWork<Tx>>>,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add some asserts that this vec cannot be empty and can't be longer than 64 elements

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impl Default for GreedySchedulerConfig {
fn default() -> Self {
Self {
max_scheduled_cus: MAX_BLOCK_UNITS / 4,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why divide by 4? Is it so that we only have a quarter of a block's worth of transactions in flight at any given time? Can you add a comment explaining how you got to that constant or why you think it's reasonable?

Btw we divide this number by the number of worker threads (default 4, but unbounded because the total number of banking threads can be overridden by the SOLANA_BANKING_THREADS env var) so by default max_scheduled_cus is 3M cu's.. which can fit 2 max cu (1.4M) transactions. I think this is fine but we should consider having a min value of 1.4M for max_cu_per_thread

And final nit: I think I would prefer a different name like target_total_in_flight_cus (and also max_cu_per_thread -> target_in_flight_cu_per_thread so that it's more clear that it's a "target" for "in-flight" cu's which can be exceeded.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are "soft" limits. We go until we exceed, not denying until hit. So if there's 2_999_999 in-flight, we would technically be able to schedule a 1.4m tx for a "true" max of 4_399_999. This means so long as it's not 0, we will always be able to schedule at least one tx.

Yeah renaming them to target is great, I agree max is probably misleading due to the softness of the limits.

48M/4 was chosen as we just saw this number working well - we want to expose this as a knob for operators to tune, but not happening in the first version of this.
We want this number to be as small as possible, while we still fill blocks. It can't be too small otherwise the workers idle too much, but we don't want to make is so large that any very high-priority tx we receive in our block(s) can't be scheduled because we scheduled all our blocks at the beginning.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to target: f946950 + bd3fc09

}
}

/// Dead-simple scheduler that is efficient but **really** greedy.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on the fact that this scheduler will basically allocate up to X% (currently 25%) of the block space in a single pass with non-conflicting transactions? I think it's important to call out that it's greedy in terms of allocating work evenly to the consume workers, but not greedy in terms of only picking high pri transactions as one might assume. Then on each subsequent schedule operation, it will fill any space that was freed from completed work in order to keep workers at a consistent workload.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but not greedy in terms of only picking high pri transactions as one might assume.

I need to change this comment - this WAS how it originally worked.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

};

// If there is a conflict with any of the transactions in the current batches,
// we should immediately send out the batches, so this transaction may be scheduled.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was a little confusing. Can you elaborate that we are sending batches immediately because we want to start processing any blocking transactions immediately so that we can get them completed asap and unblock this high-priority transaction?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batches cannot have conflicting transactions (until SIMD-0083 activation). So we track which accounts are used in the batches we're currently building. If there's a conflict with those batches, we send the batches out to the workers now. That way the transaction that was conflicting with one of the batches could be inserted into a new batch.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we're on the same page here. Was just asking you to improve the code comment with my suggested elaboration (add yours too while you're at it 😉 )

}
}

fn try_schedule_transaction<Tx: TransactionWithMeta>(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by strict priority guards?

@apfitzge
Copy link
Author

apfitzge commented Feb 4, 2025

I've added some commits to address review comments. I will rebase and fixup before merge, but leaving as separate commits so it is easier for you to see how the review comments were addressed.

@jstarry @tao-stones lmk if happy with the fixes and I'll squash them into the appropriate commits (makes backporting easier).

Comment on lines 197 to 201
self.send_batches(&mut batches)?;
assert_eq!(
num_scheduled, num_sent,
"number of scheduled and sent transactions must match"
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why we need this assert. Also, shouldn't num_sent be incremented here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks like CI just failed here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheduled means we put it into a batch.
num_sent assert is simply a sanity check to make sure we actually sent everything we scheduled to the workers, and didn't just drop them somewhere above.

Yes - forgot to increment num_sent here, which caused a test to fail (it should)

Fix: 98eaa98

tao-stones
tao-stones previously approved these changes Feb 5, 2025
Copy link

@tao-stones tao-stones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - thanks for explaining all the details

@apfitzge apfitzge force-pushed the greedy_fuck_scheduler branch from 98eaa98 to 18115e8 Compare February 5, 2025 15:43
@apfitzge
Copy link
Author

apfitzge commented Feb 5, 2025

^recent push squished all the commits that came from review into the appropriate original commits

jstarry
jstarry previously approved these changes Feb 5, 2025
@apfitzge apfitzge dismissed stale reviews from jstarry and tao-stones via e861d97 February 5, 2025 23:00
@apfitzge apfitzge force-pushed the greedy_fuck_scheduler branch from 18115e8 to e861d97 Compare February 5, 2025 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2.1 Backport to v2.1 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants