Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

Introduce primitive threading in unified scheduler #34676

Merged
merged 14 commits into from
Jan 24, 2024

Conversation

ryoqun
Copy link
Contributor

@ryoqun ryoqun commented Jan 6, 2024

(no need to start to review this at weekend).

Problem

There's no threading support in the unified scheduler, which will exhibit an extremely minimized overhead for unbatched processing.

Summary of Changes

Add minimum threading model and wire it, which still lacks suspending (and resuming) threads and aborting them on transaction error in dead blocks. Also, current impl is still full of .unwrap()s.

also, note that the actual scheduling code will come in later prs. after that, the code in this pr is ready for multi threaded with proper scheduling code. yet, it's still single threaded at the time of this pr. all this breakdowns are done for bite-sized pr reviewing.

Also, patching crossbeam is planned but yet not included in this pr as well.

Context

extracted from: #33070

@ryoqun ryoqun requested a review from apfitzge January 6, 2024 14:32
@@ -31,6 +35,7 @@ use {
atomic::{AtomicU64, Ordering::Relaxed},
Arc, Mutex, Weak,
},
thread::{self, JoinHandle},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, thread. welcome to the unified scheduler land. lol

// minimum at the cost of a single heap allocation per switching for the sake of Box-ing the Self
// type to avoid infinite mem::size_of() due to the recursive type structure. Needless to say, such
// an allocation can be amortized to be negligible.
enum ChainedChannel<P1, P2> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally, i used condvar, but later i noticed this trick and never looked back. :) hope i explained it well in the above comment...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the detailed comment here; I think the chained channel idea makes sense. My only real gripe is the visibility to internals within this code, the chained channel is used for synchronization but could easily be screwed up by a change in the code of the handler thread.

It seems possible we can isolate this type and offer an interface for recving that will internally mutate the receiving channel; which will guarantee each receiver can only receive it once (which is currently depending on the handler to overwrite its' receiver).

(pseudo-code with removed generics)

struct ChainedChannelReceiver<...> {
    receiver: Receiver<...>,
    context: Context,
}

impl ChainedChannelReceiver {
    /// Blocks to receive the next Payload.
    /// If a context-switch is recveived, the context and channel are mutated,
    /// and the next payload message will be blocked on.
    pub fn recv(&mut self) -> Result<Payload> {
        loop {
            match self.receiver.recv()? {
                Payload(p) => return Ok(p),
                PayloadAndChannel((p, new_channel)) => {
                    self.receiver = new_channel;
                    self.context = p;
                }
            }
        }
    }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure my suggestion works if the handler threads will receive from other channels that would actually require a select! though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice suggestion :) 9b5e338

Not sure my suggestion works if the handler threads will receive from other channels that would actually require a select! though.

this is good point as well. i did my best in the commit.

@@ -577,8 +966,10 @@ mod tests {

#[derive(Debug)]
struct AsyncScheduler<const TRIGGER_RACE_CONDITION: bool>(
PooledScheduler<DefaultTaskHandler>,
Copy link
Contributor Author

@ryoqun ryoqun Jan 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now stop using the impl by wrapping as the impl is getting mature, meaning it's harder to use for this race condition test

}

#[derive(Debug)]
pub struct PooledSchedulerInner<S: SpawnableScheduler<TH>, TH: TaskHandler> {
id: SchedulerId,
thread_manager: ThreadManager<S, TH>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that ThreadManager is planned to be wrapped in Arc<Mutex<_>> soon. so, i didn't flatten it into PooledSchedulerInner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well well well, I ended up not doing so with different impl.

@@ -295,7 +677,7 @@ impl<TH: TaskHandler> InstalledScheduler for PooledScheduler<TH> {
}

fn pause_for_recent_blockhash(&mut self) {
// not surprisingly, there's nothing to do for this min impl!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:byebye:

move || loop {
let mut is_finished = false;
while !is_finished {
select! {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select!'s directive order is logically off intentionally for upcoming use of select_biased!..

Comment on lines 411 to 469
// High-level flow of new tasks:
// 1. the replay stage thread send a new task.
// 2. the scheduler thread accepts the task.
// 3. the scheduler thread dispatches the task after proper locking.
// 4. the handler thread processes the dispatched task.
// 5. the handler thread reply back to the scheduler thread as an executed task.
// 6. the scheduler thread post-processes the executed task.
// 7. the scheduler thread send the executed task to the accumulator thread.
// 8. the accumulator thread examines the executed task's result and accumulate its timing,
// finally dropping the transaction inside the executed task.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

across the new code, i used the term task instead of transaction for envisioning bundles as well along with transactions under the generic struct of Task.....

);
}

fn start_threads(&mut self, context: &SchedulingContext) {
Copy link
Contributor Author

@ryoqun ryoqun Jan 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closures in this method is pretty full of unwraps on crossbeam channels. most of them will be addressed in later prs.

that's said, these unwraps should be acceptable even for production. That's because once scheduler is spawn, it never stops any of the backing threads in this pr.

Comment on lines 316 to 384
scheduler_thread: Option<JoinHandle<()>>,
handler_threads: Vec<JoinHandle<()>>,
accumulator_thread: Option<JoinHandle<()>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that these join handles aren't used properly yet in this pr.

Copy link

codecov bot commented Jan 6, 2024

Codecov Report

Attention: 18 lines in your changes are missing coverage. Please review.

Comparison is base (e2c2029) 81.7% compared to head (afd580e) 81.7%.
Report is 22 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #34676    +/-   ##
========================================
  Coverage    81.7%    81.7%            
========================================
  Files         825      825            
  Lines      223261   223447   +186     
========================================
+ Hits       182605   182778   +173     
- Misses      40656    40669    +13     

Comment on lines 418 to 468
// 7. the scheduler thread send the executed task to the accumulator thread.
// 8. the accumulator thread examines the executed task's result and accumulate its timing,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is accumulation expensive? Why not just do accumulation in the scheduling thread?

Copy link
Contributor Author

@ryoqun ryoqun Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it is expensive. any offload-able work should be put elsewhere from the scheduling thread: beae861

like #34676 (comment), i actually observed this.

}
};

// This thread is needed because .accumualte() and drop::<SanitizedTransaction>() both are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is heavy in SanitizedTransaction::drop, it should just be free-ing some heap allocations, right?

Copy link
Contributor Author

@ryoqun ryoqun Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, just free-ing some heap allocations is relatively heavy for the unified scheduler. Also, SanitizedTransaction is rather complex struct: beae861 i documented how it's severe in that long monologue in the commit

I observed some visible cpu usage just from SanitizedTransaction::drop when casual profiling.

let finished_task_sender = finished_task_sender.clone();

move || loop {
let (task, sender) = select! {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no selection going on here, since there's only one channel receiving?

Copy link
Contributor Author

@ryoqun ryoqun Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, nice question. select! here is intentional because the handler will soon use prioritized queue, which will be realized just by using two channels together: 1. low priority tasks (not conflicting with others), 2. high priority tasks (conflicting with others).

so, I'll leave this rather odd select! as-is in this pr.

Copy link
Contributor Author

@ryoqun ryoqun Apr 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the handler will soon use prioritized queue, ....

soon :trollface: anza-xyz#627 (comment) (EDIT: anza-xyz#627 (comment))

// minimum at the cost of a single heap allocation per switching for the sake of Box-ing the Self
// type to avoid infinite mem::size_of() due to the recursive type structure. Needless to say, such
// an allocation can be amortized to be negligible.
enum ChainedChannel<P1, P2> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the detailed comment here; I think the chained channel idea makes sense. My only real gripe is the visibility to internals within this code, the chained channel is used for synchronization but could easily be screwed up by a change in the code of the handler thread.

It seems possible we can isolate this type and offer an interface for recving that will internally mutate the receiving channel; which will guarantee each receiver can only receive it once (which is currently depending on the handler to overwrite its' receiver).

(pseudo-code with removed generics)

struct ChainedChannelReceiver<...> {
    receiver: Receiver<...>,
    context: Context,
}

impl ChainedChannelReceiver {
    /// Blocks to receive the next Payload.
    /// If a context-switch is recveived, the context and channel are mutated,
    /// and the next payload message will be blocked on.
    pub fn recv(&mut self) -> Result<Payload> {
        loop {
            match self.receiver.recv()? {
                Payload(p) => return Ok(p),
                PayloadAndChannel((p, new_channel)) => {
                    self.receiver = new_channel;
                    self.context = p;
                }
            }
        }
    }
}

// minimum at the cost of a single heap allocation per switching for the sake of Box-ing the Self
// type to avoid infinite mem::size_of() due to the recursive type structure. Needless to say, such
// an allocation can be amortized to be negligible.
enum ChainedChannel<P1, P2> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure my suggestion works if the handler threads will receive from other channels that would actually require a select! though.

@apfitzge apfitzge self-requested a review January 8, 2024 20:54
Comment on lines 953 to 954
// and unified_scheduler both tend to process non-conflicting batches in parallel as normal
// operation.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wording: as part of normal functioning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: d781743

@@ -194,34 +199,422 @@ impl TaskHandler for DefaultTaskHandler {
}
}

pub struct ExecutedTask {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove pub

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: a2eb73e

@ryoqun ryoqun force-pushed the unified-scheduler-min-threading branch from 36632b4 to 9b5e338 Compare January 9, 2024 15:58
@ryoqun
Copy link
Contributor Author

ryoqun commented Jan 9, 2024

@apfitzge thanks for timely review! i think I've addressed all your review comments.

Comment on lines 516 to 519
// That's because it's the most notable bottleneck of throughput. Unified scheduler's
// overall throughput is largely dependant on its ultra-low latency characteristic,
// which is the most important design goal of the scheduler in order to reduce the
// transaction confirmation latency for end users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be the bottleneck of just scheduling, but the main bottleneck of replay will still be the processing of transactions (load, execute, commit), right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler's ultra-sensitivity to latency seems to be a self-imposed design choice, and I am trying to keep us from introducing additional threads if we can avoid it. We already have a huge over-use of threads in the validator; so trying to avoid introduction of new ones where possible, unless it has demonstrated significant improvement.

I think the sensitivity is most important when all or nearly all transactions conflict. Let's just assume there's a single handler and all transactions conflict for simplicity. A simple sequence diagram:

sequenceDiagram
    Scheduler->>+Handler: Task(Tx0)
    Handler->>+Scheduler: ExecutedTask(Tx0)
    Scheduler->>+Handler: Task(Tx1)
Loading

Correct me if I'm wrong, but it seems the sensitiviy is there because between the time of ExecutedTask(Tx0) and Task(Tx1), the handler thread is idle.
Even with the additional thread which does the accumulation and dropping, there is still going to be some amount of idle time here.

Just as a note, I also ran into this in the block-production scheduler in core, but opted to allow the queueing of txs for specific handlers. Even trying to keep just 2 txs always enqueued gives significant leeway in the latency, since as long as the scheduler can queue another in the time it takes to process a transaction there is nearly no idling time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be the bottleneck of just scheduling, but the main bottleneck of replay will still be the processing of transactions (load, execute, commit), right?

Currently, yes. given ~100 handler threads, scheduler thread can be the bottleneck. and I'm developing this scheduler with that future in mind.

Correct me if I'm wrong, but it seems the sensitiviy is there because between the time of ExecutedTask(Tx0) and Task(Tx1), the handler thread is idle.
Even with the additional thread which does the accumulation and dropping, there is still going to be some amount of idle time here.

yeah. these are correct.

Just as a note, I also ran into this in the block-production scheduler in core, but opted to allow the queueing of txs for specific handlers. Even trying to keep just 2 txs always enqueued gives significant leeway in the latency, since as long as the scheduler can queue another in the time it takes to process a transaction there is nearly no idling time.

hmm, this seems odd. busy looping at the other hand of mpmc channels usually gives round-trip less-than-500ns latency, which i'm planning to employ for unified scheduler. i think those threads were yielding to the kernel (ie parking)?

as you pointed out, i can't eliminate the idle time. but then, 1us (500ns for channel + 500ns for scheduling) per a single conflicting tx is acceptable latency.

actually, i wanted to show this as some bench, but i'm running out of time.. i spend too much on the other concern... lol #34676 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think those threads were yielding to the kernel (ie parking)?

possible.

as you pointed out, i can't eliminate the idle time. but then, 1us (500ns for channel + 500ns for scheduling) per a single conflicting tx is acceptable latency.

what's the added latency from dropping the txs? or in other words, what would be the affect on overall throughput if we did not have the separate "drop" thread? My main concern here was just that we're spawning this new thread as part of the scheduler, and if that's necessary we should be able to back that decision up with some numbers on how much it helps (or how much we are hurt by not doing that).

as you pointed out, i can't eliminate the idle time.

yeah I think without allowing conflicts to actually queue up, this cannot be resolved entirely, just mitigated.

Copy link
Contributor Author

@ryoqun ryoqun Jan 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for detailing your concern.

i've finally wrote benches: 6120684 (won't push to this pr)....:

test bench_tx_throughput_drop_in_scheduler ... bench:   8,556,627 ns/iter (+/- 3,168,724)
test bench_tx_throughput_drop_in_accumulator ... bench:   7,334,701 ns/iter (+/- 2,600,740)
test bench_tx_throughput_drop_in_scheduler_conflicting ... bench:  11,403,197 ns/iter (+/- 3,062,017)
test bench_tx_throughput_drop_in_accumulator_conflicting ... bench:  10,794,434 ns/iter (+/- 6,808,692)

so, processing 10,000 txes takes always longer when we drop txes in the scheduler thread, regardless conflicting or not.

note that you have to ensure to rebuild when repoto these results.. cargo usually skips rebuild with feature unification... (i.e use cargo clean --release -p solana-unified-scheduler-pool).

EDIT: oh, i forget to execute_timings.accumulate() in the scheduler thread. so, these numbers could go worse for dropping in the scheduler thread..

Copy link
Contributor

@apfitzge apfitzge Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The raw numbers looks like dropping in scheduler takes longer, which isn't surprising since its doing more work; but they are well within the statistical noise.

bench_tx_throughput_drop_in_scheduler_conflicting raw shows 609us higher than bench_tx_throughput_drop_in_accumulator_conflicting, but the error bounds are 3ms and 6ms.

This seems to imply that there's not a statistically significant effect from having the separate accumulator thread. Am I missing something in these results?

edit: I think we are not really seeing eye-to-eye on this additional thread, and might just be we have different perspectives. Feel free to pull another core contributor in for their opinion on this aspect of the PR. I don't recall seeing anything else that would block this PR for me, but will take another look once the thread-issue is settled.

Copy link
Contributor Author

@ryoqun ryoqun Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the feedback. Seems casual benchmark wasn't enough.

I reverted the accumulator thread (2f592f5) for now to land this pr.

Now that I've done somewhat extended synthesized bench-marking on this pr. That said, I still think the accumulator thread performs better in the end.

However, because this pr doesn't contain any actual scheduling impl, i concluded that presenting some convincing numbers is rather difficult. ;) So, I'll try to ship the scheduling impl first, instead of finishing up the scheduler pool (was original plan). Even later, I'll try to land the accumulator thread along with other missing things in the scheduler pool with more proper justification / more realistic benchmark numbers.

In other words, landing the accumulator thread wasn't essential for now. So, I was too early to ship it.

Am I missing something in these results?

yea. blame me for not explaining enough. didn't expect the accumulator thread to be so hot...

  1. the accumulator thread will contain other offloaded tasks.

Namely, it's used for dumping all detailed timings for individual transactions to influxdb at background fashion (the code is in the base pr). Also, I'm planning to move rpc related blockstore writes here, as this isn't essential for bank hash. These aren't happening in this pr, so the accumulator thread itself was just ahead of it.

  1. dropping transactions can be really heavy

in the intention of a quick sanity check bench, i wasn't bothered to do so. but transactions can be really large with many instructions with each having instruction data blobs, pubkeys...

when blocks contain these transactions (it does usually), it clearly throttles the scheduler thread when dropped inside that.

  1. medians inside overlapping deviation range don't necessarily mean statistical noise.

firstly, these multi-threaded benches tend to contain high variance in runtime duration. so, i did little attention to the deviations. I just shared the numbers after some casual multiple runs, thinking these numbers are relatively stable enough.

maybe, seeing is believing. :) here's comparable criterion bench results, which i wrote with different parameters/settings:

image

image

note that I think these periodic outliers are due to cpu freq scaling. Sans these outlifers, I think the primary convex of pdf is shifted in favor of dropping in accumlator, clearly showing statistically significant improvement. at the same time, these median differencies are close to each other within the deviations (=~ error bounds). Also, this improvement can visibly be confirmed by plotted individual samples.

I think we are not really seeing eye-to-eye on this additional thread, and might just be we have different perspectives.

i think we have a constructive discussion. thanks for patience and expressing your perspective.

Feel free to pull another core contributor in for their opinion on this aspect of the PR.

thanks for suggestion. yeah, we could seek other's opinions, should we be stuck in other topics.

I don't recall seeing anything else that would block this PR for me, but will take another look once the thread-issue is settled.

that's good to hear. :)

unified-scheduler-pool/src/lib.rs Show resolved Hide resolved
}

// P doesn't need to be `: Clone`, yet rustc derive can't handle it.
// see https://github.com/rust-lang/rust/issues/26925
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun
Copy link
Contributor Author

ryoqun commented Jan 12, 2024

@apfitzge thanks for in-depth code review! it took time at my side, but i think it's ready for another review round from you.

also, i clarified comments a bit from the prev review round: d6093e1

thanks as always.

if possible, i want to land this pr soon before the feature freeze: https://discord.com/channels/428295358100013066/910937142182682656/1194783158676230268

@ryoqun ryoqun force-pushed the unified-scheduler-min-threading branch 2 times, most recently from 305c2c5 to 5b62918 Compare January 18, 2024 07:06
@ryoqun ryoqun force-pushed the unified-scheduler-min-threading branch from 5b62918 to cc5a1f0 Compare January 18, 2024 07:15
@ryoqun
Copy link
Contributor Author

ryoqun commented Jan 18, 2024

@apfitzge thanks for review! I've replied to you: #34676 (comment)

also, i pushed new commit: cc5a1f0 and rebase was needed to pick 0e8f2de

apfitzge
apfitzge previously approved these changes Jan 18, 2024
Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of minor comments, feel free to just do them as follow-up PRs.
Appreciate all the back-and-forth on the PR!

}

// A very tiny generic message type to signal about opening and closing of subchannels, which are
// logically segmented series of Payloads (P1) over a signle continuous time-span, potentially
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// logically segmented series of Payloads (P1) over a signle continuous time-span, potentially
// logically segmented series of Payloads (P1) over a single continuous time-span, potentially

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, nice!: 07de5b4

}
}

impl<const TRIGGER_RACE_CONDITION: bool> InstalledScheduler
for AsyncScheduler<TRIGGER_RACE_CONDITION>
{
fn id(&self) -> SchedulerId {
self.0.id()
todo!();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to use unimplemented!(), I don't believe you are intending to do this, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right: afd580e

// cycles out of the scheduler thread. Thus, any kinds of unessential overhead sources
// like syscalls, VDSO, and even memory (de)allocation should be avoided at all costs
// by design or by means of offloading at the last resort.
move || loop {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no exit from this loop aside from crashing as far as I can tell.
If replay stage is shutting down, channels get disconnected on the send side and we'd hit Errs in the unwrap - instead of the scheduler loop cleaning shutting down.

Feel free to address in a follow-up PR, as I think you mentioned you were planning already.

Copy link
Contributor Author

@ryoqun ryoqun Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your analysis is right.

Feel free to address in a follow-up PR, as I think you mentioned you were planning already.

yeah. I'll do. but sadly, solana-validator/solana-ledger-tool's process termination is so broken in multiple ways. so, the cascading panics won't happen actually. that's why ci is green, maybe.

namely, there's some cyclic link which prevents bank_forks from being dropped. i have a fix in #33070. note that this bug quite low-priority, imo. there's more deeper problems: solana-validator's exit subcommand and solana-ledger-tool themselvs aren't trying to do sane shutdown cleaning up.

so, I won't dig in this tempting rabbit hole before i ship the scheduler code at least.

@ryoqun
Copy link
Contributor Author

ryoqun commented Jan 19, 2024

Just a couple of minor comments, feel free to just do them as follow-up PRs. Appreciate all the back-and-forth on the PR!

thanks for timely review as always. according to the discord talk, this pr is blocked (https://discord.com/channels/428295358100013066/910937142182682656/1197716176764153966) and i just addressed nits meanwhile and re-requested another code review.

@apfitzge
Copy link
Contributor

Just a couple of minor comments, feel free to just do them as follow-up PRs. Appreciate all the back-and-forth on the PR!

thanks for timely review as always. according to the discord talk, this pr is blocked (discord.com/channels/428295358100013066/910937142182682656/1197716176764153966) and i just addressed nits meanwhile and re-requested another code review.

Sorry my slowness made us miss the cutoff!

@ryoqun
Copy link
Contributor Author

ryoqun commented Jan 23, 2024

Sorry my slowness made us miss the cutoff!

well, i don't think so. as i said above, your reviews were very timely. missing the cutoff was mainly due to my hacking and benching, resulting in extended back-and-forth..

seems v1.18 branching is imminent (https://discord.com/channels/428295358100013066/910937142182682656/1199023180598218803), so i just humbly request another approval other then a normal comment. :)

Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ryoqun ryoqun merged commit bd10386 into solana-labs:master Jan 24, 2024
46 checks passed
Comment on lines +487 to +524
// Now, this is the main loop for the scheduler thread, which is a special beast.
//
// That's because it could be the most notable bottleneck of throughput in the future
// when there are ~100 handler threads. Unified scheduler's overall throughput is
// largely dependant on its ultra-low latency characteristic, which is the most
// important design goal of the scheduler in order to reduce the transaction
// confirmation latency for end users.
//
// Firstly, the scheduler thread must handle incoming messages from thread(s) owned by
// the replay stage or the banking stage. It also must handle incoming messages from
// the multi-threaded handlers. This heavily-multi-threaded whole processing load must
// be coped just with the single-threaded scheduler, to attain ideal cpu cache
// friendliness and main memory bandwidth saturation with its shared-nothing
// single-threaded account locking implementation. In other words, the per-task
// processing efficiency of the main loop codifies the upper bound of horizontal
// scalability of the unified scheduler.
//
// Moreover, the scheduler is designed to handle tasks without batching at all in the
// pursuit of saturating all of the handler threads with maximally-fine-grained
// concurrency density for throughput as the second design goal. This design goal
// relies on the assumption that there's no considerable penalty arising from the
// unbatched manner of processing.
//
// Note that this assumption isn't true as of writing. The current code path
// underneath execute_batch() isn't optimized for unified scheduler's load pattern (ie.
// batches just with a single transaction) at all. This will be addressed in the
// future.
//
// These two key elements of the design philosophy lead to the rather unforgiving
// implementation burden: Degraded performance would acutely manifest from an even tiny
// amount of individual cpu-bound processing delay in the scheduler thread, like when
// dispatching the next conflicting task after receiving the previous finished one from
// the handler.
//
// Thus, it's fatal for unified scheduler's advertised superiority to squeeze every cpu
// cycles out of the scheduler thread. Thus, any kinds of unessential overhead sources
// like syscalls, VDSO, and even memory (de)allocation should be avoided at all costs
// by design or by means of offloading at the last resort.
Copy link
Contributor Author

@ryoqun ryoqun Mar 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here context: anza-xyz#129 (comment)

result_with_timings: Mutex::new((Ok(()), ExecuteTimings::default())),
}
fn from_inner(mut inner: Self::Inner, context: SchedulingContext) -> Self {
inner.thread_manager.start_session(&context);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this

// we're hard-coding the number of handler thread to 1, meaning this impl is currently
// single-threaded still.
let handler_count = 1;

Self::from_inner(
Copy link
Contributor Author

@ryoqun ryoqun Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this from_inner() is especially problematic. (back ref: anza-xyz#1815)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants