GreedyScheduler #4688

apfitzge · 2025-01-29T17:38:18Z

Problem

PrioGraphScheduler performance struggles when network activity gets very high (see recent activity)

Summary of Changes

Add a Scheduler trait to abstract specfici scheduling logic in the SchedulerController
Add GreedyScheduler for new scheduling logic
Hookup GreedyScheduler use to CLI via --block-production-method central-scheduler-greedy

Fixes #

apfitzge · 2025-01-29T17:40:09Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+
+    /// Receive completed batches of transactions without blocking.
+    /// Returns (num_transactions, num_retryable_transactions) on success.
+    fn receive_completed(


This function is an exact copy of PrioGraphScheduler::receive_completed.
Duplicating to avoid refactors in a PR intended to be backported to 2.1.

apfitzge · 2025-01-29T17:40:42Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+    }
+}
+
+impl<Tx: TransactionWithMeta> GreedyScheduler<Tx> {


Everything in this impl block is an exact copy from PrioGraphScheduler.
Duplicating to avoid refactors in a PR intended to be backported to 2.1.

apfitzge · 2025-01-29T17:42:42Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+    }
+}
+
+fn try_schedule_transaction<Tx: TransactionWithMeta>(


This function is slightly different than the PrioGraphScheduler version; we do not have strict priority guards for unschedulable transactions.

I intend to add this as an option in a future PR, but keeping this scheduler simpler for now.

What do you mean by strict priority guards?

Example:

graph LR; A((A)); AC((AC)); BC((BC)); A --> AC --> BC;

Loading

Let's say C is write-locked on thread 0, and we schedule A to thread 1.
AC cannot be immediately scheduled because it has conflicts with multiple threads.
BC can be immediately scheduled because it only conflicts with the C lock on thread 0.

Right now, we would schedule BC. A strict priority guard would mark A and C as blocked locks once we saw AC, and stop other txs from taking those locks since the highest priority item (AC) can't be scheduled at this time.
There's trade-offs to both approaches - strictness is more profitable if you can still fill the blocks, but reduces parallelism and reduces the chances to fill blocks. Non-strictness leads to these chances of out of order prioritization, and less consistent UX.

Ok was thinking something along these lines, thanks!

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

apfitzge · 2025-01-29T17:47:08Z

There are sure to be some backport conflicts due to the lack of the generic transaction type in 2.1.

The diff of this PR appears large:

~250 lines are for the greedy scheduler structure and logic itself.
~150 lines are duplicated (made comments) from the PrioGraphScheduler.
~350 lines of tests (much are duplicated from PrioGraphScheduler)

apfitzge · 2025-01-29T17:53:05Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+            // If there is a conflict with any of the transactions in the current batches,
+            // we should immediately send out the batches, so this transaction may be scheduled.


also commenting this is only necessary until SIMD-0083

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

apfitzge · 2025-01-29T17:57:38Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+                transaction_state,
+                &pre_lock_filter,
+                &mut self.account_locks,
+                schedulable_threads,


Using schedulable_threads here so that only threads that are not at capacity are scheduled to.

mergify · 2025-01-29T17:58:51Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

bw-solana

What's the rollout plan? Make the scheduler opt in for v2.2, default in v2.3? Any dreams of backporting to v2.1?

core/src/banking_stage.rs

core/src/banking_stage/transaction_scheduler/scheduler.rs

bw-solana · 2025-01-31T15:29:47Z

core/src/banking_stage.rs

-                        Err(SchedulerError::DisconnectedRecvChannel(_)) => {}
-                        Err(SchedulerError::DisconnectedSendChannel(_)) => {
-                            warn!("Unexpected worker disconnect from scheduler")
+        if use_greedy_scheduler {


Possible to dedup any of this code?

EDIT: originally reply thought you were referring to somewhere else in this PR because I didn't look...

Probably possible. Tried to keep things as dead-simple and straight-forward for now as possible since this is for a backport.

Type-differences come into play here so we can't do assign scheduler and just go from there unless we introduce dyn shit.
I'll probably declutter this by just using a macro...since it is literally all the same once we create the scheduler object.

simple is good

apfitzge · 2025-01-31T16:51:23Z

What's the rollout plan? Make the scheduler opt in for v2.2, default in v2.3? Any dreams of backporting to v2.1?

Fully intend to backport to v2.1 - already have the de-conflicted branch ready to go as soon as this merges.
Opt-in for now.

tao-stones · 2025-01-31T18:42:15Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+impl Default for GreedySchedulerConfig {
+    fn default() -> Self {
+        Self {
+            max_scheduled_cus: MAX_BLOCK_UNITS,


are there actual use cases for it != MAX_BLOCK_UNITS (other than in tests)? Maybe can save one configure item.

tao-stones · 2025-01-31T18:42:49Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+    fn default() -> Self {
+        Self {
+            max_scheduled_cus: MAX_BLOCK_UNITS,
+            max_scanned_transactions_per_scheduling_pass: 1000,


nit: why 1_000 isn't a const like the others?

tao-stones · 2025-01-31T18:44:41Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+        Self {
+            max_scheduled_cus: MAX_BLOCK_UNITS,
+            max_scanned_transactions_per_scheduling_pass: 1000,
+            target_transactions_per_batch: TARGET_NUM_TRANSACTIONS_PER_BATCH,


TARGET_NUM_TRANSACTIONS_PER_BATCH is really an argument of Consumer, which Scheduler already has. Could save anther config item if Schedule can get this number from its Consumer.

Consumer doesn't determine batches though, scheduler does. Consumer just executes what it's given.

TARGET_NUM_TRANSACTIONS_PER_BATCH is defined in Consumer tho

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

core/src/banking_stage/transaction_scheduler/scheduler.rs

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

tao-stones · 2025-01-31T21:06:20Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+                    container.remove_by_id(id.id);
+                }
+                Err(TransactionSchedulingError::UnschedulableConflicts) => {
+                    num_unschedulable += 1;


Suggested change

num_unschedulable += 1;

saturating_add_assign!(num_unschedulable, 1);

this is limited by our buffer size. we cannot possibly store more than u64::MAX transactions in a buffer, so there's no risk of this overflowing.

prefer to avoid adding if statements everywhere in our code for something than will not ever happen. even if it's not strictly guaranteed by the single code line, it is by the context.

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

jstarry

Can you elaborate more on why you think prio-graph didn't perform well and why greedy can avoid those issues? To me, it looks like greedy is nice because it limits in-flight cu's to 25% of the block space. Seems like we could do the same thing for prio-graph. Also seems like prio-graph scheduler could scan more tx's.. only configured for 1000 right now in max_scanned_transactions_per_scheduling_pass.

jstarry · 2025-02-04T03:57:03Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+    fn schedule<S: StateContainer<Tx>>(
+        &mut self,
+        container: &mut S,
+        _pre_graph_filter: impl Fn(&[&Tx], &mut [bool]),


what do the filter metrics look like for the prio graph scheduler? Seems like we should probably run the status cache checks at the very least?

the point of this filter was to prevent the graph from getting out of priority order - you can imagine:

graph LR; A1((A1)) --> A2((A2)) --> A3((A3)) --> A4((A4)) --> A5((A5)) B1((B1)) --> B2((B2)) C1((C1)) --> C2((C2))

Loading

Let's say all the A txs have very high priority relative to B and C. All of the stuff will land on thread 1; and there's other stuff on the other threads.

The way prio-graph works is we'd pop A1, B1, C1 from the top of graph (no conflicts rn), and schedule them in a batch (A1, B1, C1).
But if A1 fails status checks, A2 is effectively delayed one batch now.
So we run the checks before graph insertion, A1 is dropped, and we schedule (A2, B1, C1).

Greedy scheduler approaches this differently and we don't take the hard attempt at batching like that - we just always try to schedule the highest priority.
So the first batch would be A1, then A2. So even though A2 is delayed, because A1 wasn't batched with anything else we can just let the worker run the check.

Ah this explanation makes a lot of sense, thank you! So for greedy we could be sending batches with transactions that would get filtered out but that's fine because we already queued up the next high priority transaction in the worker's queue and the worker will just chew through its batches.

jstarry · 2025-02-04T03:57:54Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+
+impl<Tx: TransactionWithMeta> GreedyScheduler<Tx> {
+    pub(crate) fn new(
+        consume_work_senders: Vec<Sender<ConsumeWork<Tx>>>,


nit: add some asserts that this vec cannot be empty and can't be longer than 64 elements

jstarry · 2025-02-04T04:55:49Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+impl Default for GreedySchedulerConfig {
+    fn default() -> Self {
+        Self {
+            max_scheduled_cus: MAX_BLOCK_UNITS / 4,


Why divide by 4? Is it so that we only have a quarter of a block's worth of transactions in flight at any given time? Can you add a comment explaining how you got to that constant or why you think it's reasonable?

Btw we divide this number by the number of worker threads (default 4, but unbounded because the total number of banking threads can be overridden by the SOLANA_BANKING_THREADS env var) so by default max_scheduled_cus is 3M cu's.. which can fit 2 max cu (1.4M) transactions. I think this is fine but we should consider having a min value of 1.4M for max_cu_per_thread

And final nit: I think I would prefer a different name like target_total_in_flight_cus (and also max_cu_per_thread -> target_in_flight_cu_per_thread so that it's more clear that it's a "target" for "in-flight" cu's which can be exceeded.

These are "soft" limits. We go until we exceed, not denying until hit. So if there's 2_999_999 in-flight, we would technically be able to schedule a 1.4m tx for a "true" max of 4_399_999. This means so long as it's not 0, we will always be able to schedule at least one tx.

Yeah renaming them to target is great, I agree max is probably misleading due to the softness of the limits.

48M/4 was chosen as we just saw this number working well - we want to expose this as a knob for operators to tune, but not happening in the first version of this.
We want this number to be as small as possible, while we still fill blocks. It can't be too small otherwise the workers idle too much, but we don't want to make is so large that any very high-priority tx we receive in our block(s) can't be scheduled because we scheduled all our blocks at the beginning.

rename to target: f946950 + bd3fc09

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

jstarry · 2025-02-04T06:21:08Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+    }
+}
+
+/// Dead-simple scheduler that is efficient but **really** greedy.


Can you elaborate more on the fact that this scheduler will basically allocate up to X% (currently 25%) of the block space in a single pass with non-conflicting transactions? I think it's important to call out that it's greedy in terms of allocating work evenly to the consume workers, but not greedy in terms of only picking high pri transactions as one might assume. Then on each subsequent schedule operation, it will fill any space that was freed from completed work in order to keep workers at a consistent workload.

but not greedy in terms of only picking high pri transactions as one might assume.

I need to change this comment - this WAS how it originally worked.

jstarry · 2025-02-04T07:22:23Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+            };
+
+            // If there is a conflict with any of the transactions in the current batches,
+            // we should immediately send out the batches, so this transaction may be scheduled.


This comment was a little confusing. Can you elaborate that we are sending batches immediately because we want to start processing any blocking transactions immediately so that we can get them completed asap and unblock this high-priority transaction?

Batches cannot have conflicting transactions (until SIMD-0083 activation). So we track which accounts are used in the batches we're currently building. If there's a conflict with those batches, we send the batches out to the workers now. That way the transaction that was conflicting with one of the batches could be inserted into a new batch.

Yeah, I think we're on the same page here. Was just asking you to improve the code comment with my suggested elaboration (add yours too while you're at it 😉 )

jstarry · 2025-02-04T07:24:20Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+    }
+}
+
+fn try_schedule_transaction<Tx: TransactionWithMeta>(


What do you mean by strict priority guards?

apfitzge · 2025-02-04T16:20:14Z

I've added some commits to address review comments. I will rebase and fixup before merge, but leaving as separate commits so it is easier for you to see how the review comments were addressed.

@jstarry @tao-stones lmk if happy with the fixes and I'll squash them into the appropriate commits (makes backporting easier).

jstarry · 2025-02-04T17:43:01Z

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs

+        self.send_batches(&mut batches)?;
+        assert_eq!(
+            num_scheduled, num_sent,
+            "number of scheduled and sent transactions must match"
+        );


not sure why we need this assert. Also, shouldn't num_sent be incremented here?

Yeah looks like CI just failed here

scheduled means we put it into a batch.
num_sent assert is simply a sanity check to make sure we actually sent everything we scheduled to the workers, and didn't just drop them somewhere above.

Yes - forgot to increment num_sent here, which caused a test to fail (it should)

Fix: 98eaa98

tao-stones

lgtm - thanks for explaining all the details

apfitzge · 2025-02-05T15:59:37Z

^recent push squished all the commits that came from review into the appropriate original commits

apfitzge commented Jan 29, 2025

View reviewed changes

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs Show resolved Hide resolved

apfitzge commented Jan 29, 2025

View reviewed changes

core/src/banking_stage/transaction_scheduler/greedy_scheduler.rs Outdated Show resolved Hide resolved

apfitzge commented Jan 29, 2025

View reviewed changes

apfitzge added the v2.1 Backport to v2.1 branch label Jan 29, 2025

apfitzge self-assigned this Jan 29, 2025

apfitzge marked this pull request as ready for review January 29, 2025 21:10

apfitzge requested review from bw-solana and alessandrod January 29, 2025 21:10

apfitzge force-pushed the greedy_fuck_scheduler branch from 21a83fd to 9d815d1 Compare January 30, 2025 21:48

apfitzge requested review from jstarry and tao-stones January 30, 2025 21:48

bw-solana reviewed Jan 31, 2025

View reviewed changes

apfitzge force-pushed the greedy_fuck_scheduler branch 2 times, most recently from a182cb3 to a515487 Compare January 31, 2025 22:52

tao-stones reviewed Jan 31, 2025

View reviewed changes

apfitzge force-pushed the greedy_fuck_scheduler branch from a515487 to 5cbd7c0 Compare February 3, 2025 15:44

jstarry reviewed Feb 4, 2025

View reviewed changes

jstarry approved these changes Feb 5, 2025

View reviewed changes

tao-stones previously approved these changes Feb 5, 2025

View reviewed changes

apfitzge added 2 commits February 5, 2025 09:38

Scheduler trait

91ca23c

impl Scheduler for PrioGraphScheduler

642d278

apfitzge added 4 commits February 5, 2025 09:38

SchedulerController generic over scheduler implementation

cb5c816

GreedyScheduler implementation

3f5eb6a

GreedyScheduler: tests

a417eba

BankingStage spawn greedy-scheduler

01c124a

apfitzge force-pushed the greedy_fuck_scheduler branch from 98eaa98 to 18115e8 Compare February 5, 2025 15:43

jstarry previously approved these changes Feb 5, 2025

View reviewed changes

apfitzge added 2 commits February 5, 2025 17:00

CLI hookup for central-scheduler-greedy

ce86709

Add entry to CHANGELOG

e861d97

apfitzge dismissed stale reviews from jstarry and tao-stones via e861d97 February 5, 2025 23:00

apfitzge force-pushed the greedy_fuck_scheduler branch from 18115e8 to e861d97 Compare February 5, 2025 23:00

alessandrod approved these changes Feb 5, 2025

View reviewed changes

		// If there is a conflict with any of the transactions in the current batches,
		// we should immediately send out the batches, so this transaction may be scheduled.

	num_unschedulable += 1;
	saturating_add_assign!(num_unschedulable, 1);

GreedyScheduler #4688

Are you sure you want to change the base?

GreedyScheduler #4688

Conversation

apfitzge commented Jan 29, 2025

Problem

Summary of Changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apfitzge commented Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Jan 29, 2025

bw-solana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apfitzge Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apfitzge commented Jan 31, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jstarry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apfitzge Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apfitzge commented Feb 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tao-stones left a comment

Choose a reason for hiding this comment

apfitzge commented Feb 5, 2025

apfitzge commented Jan 29, 2025 •

edited

Loading

apfitzge Jan 31, 2025 •

edited

Loading

apfitzge Feb 4, 2025 •

edited

Loading