Track transaction performance through various stage using random mask #34789

lijunwangs · 2024-01-16T09:40:24Z

Problem

Enable the system to track transaction processing performance through various stage based on probability.

Summary of Changes

Based on the https://docs.google.com/document/d/1ig1rC0dk-ddi33JIqG9EZ4ZSq9xAJpT9fQTPaZFi_vw/edit.
We use a randomly generated 12 bits integer as a mask to match the transaction's signature. If it is matched, we mark the packet for tracking for performance in the Meta's mask. This is used efficiently down stream without doing mask matching. For these matched packets we report processing time for: fetch, sigverify and banking stage.

Fixes #

codecov · 2024-01-19T01:26:53Z

Codecov Report

Attention: Patch coverage is 73.68421% with 40 lines in your changes are missing coverage. Please review.

Project coverage is 81.7%. Comparing base (e8c87e8) to head (5ed6a58).
Report is 1 commits behind head on master.

❗ Current head 5ed6a58 differs from pull request most recent head 6a4f974. Consider uploading reports for the commit 6a4f974 to get more accurate results

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #34789     +/-   ##
=========================================
- Coverage    81.8%    81.7%   -0.1%     
=========================================
  Files         834      835      +1     
  Lines      224815   224966    +151     
=========================================
+ Hits       183919   184006     +87     
- Misses      40896    40960     +64

t-nelson

can you run some benchmarks with and without this tracking. it seems extremely likely to add detrimental overhead

core/src/banking_stage/consumer.rs

t-nelson · 2024-01-23T01:54:50Z

core/src/banking_stage/consumer.rs

@@ -206,6 +206,26 @@ impl Consumer {
            .slot_metrics_tracker
            .increment_retryable_packets_count(retryable_transaction_indexes.len() as u64);

+        // Now we track the performance for the interested transactions which is not in the retryable_transaction_indexes
+        // We assume the retryable_transaction_indexes is already sorted.
+        for (index, packet) in packets_to_process.iter().enumerate() {


is there somewhere that we can put this where we aren't adding an iteration?

I do not see an easy way to do that -- I would like avoid smudge on the SanitizedTransaction. But I think I can make this more efficient -- I do not need the binary search on the retryable index, I could do it in one simple loop.

are the actual iterations not like one stack frame deeper in most cases?

This is done many levels down using SanitizedTransaction which does not have information about the start_time.

This probably needs to be done elsewhere; this is never called with the new scheduler (for non-votes), so we'd never get any metrics if that's enabled. By the time this change gets in, it will be the default.

@apfitzge what is the new scheduler's name?

Talked to @apfitzge offline, will address both issue raised by Trent and Andrew.

core/src/banking_stage/consumer.rs

core/src/banking_stage/immutable_deserialized_packet.rs

streamer/src/nonblocking/quic.rs

transaction-metrics-tracker/src/lib.rs

lijunwangs · 2024-01-31T19:11:37Z

can you run some benchmarks with and without this tracking. it seems extremely likely to add detrimental overhead

I do not see any material differences with/without the change using bench-tps:

Rpc-client with fix

lijun_solana_com@lijun-dev8:~/solana$ ./cargo run --release --bin solana-bench-tps -- -u http://35.233.177.221:8899 --identity ~/.config/solana/id.json --tx_count 1000 --thread-batch-sleep-ms 0 -t 20 --duration 60 -n 35.233.177.221:8001 --read-client-keys ~/gce-keypairs.yaml --use-rpc-client

Highest TPS: 32385.70 sampling period 1s max transactions: 432146 clients: 1 drop rate: 0.63

[2024-01-30T09:42:05.249223482Z INFO solana_bench_tps::bench] Average TPS: 7171.7437

Highest TPS: 31997.50 sampling period 1s max transactions: 493701 clients: 1 drop rate: 0.57

[2024-01-30T09:43:53.539368674Z INFO solana_bench_tps::bench] Average TPS: 8209.673

Highest TPS: 25217.14 sampling period 1s max transactions: 378571 clients: 1 drop rate: 0.67

[2024-01-30T18:05:42.712776385Z INFO solana_bench_tps::bench] Average TPS: 6204.4365

Highest TPS: 41385.95 sampling period 1s max transactions: 288390 clients: 1 drop rate: 0.75

[2024-01-30T18:30:48.060532674Z INFO solana_bench_tps::bench] Average TPS: 4780.393

Highest TPS: 31080.66 sampling period 1s max transactions: 373902 clients: 1 drop rate: 0.67

[2024-01-31T00:34:06.877848441Z INFO solana_bench_tps::bench] Average TPS: 6120.311

Master without change:

Highest TPS: 28262.44 sampling period 1s max transactions: 520460 clients: 1 drop rate: 0.55

[2024-01-31T02:12:00.852692886Z INFO solana_bench_tps::bench] Average TPS: 8523.614

Highest TPS: 34402.00 sampling period 1s max transactions: 354908 clients: 1 drop rate: 0.67

[2024-01-31T02:21:19.220559104Z INFO solana_bench_tps::bench] Average TPS: 5904.669

Highest TPS: 29222.58 sampling period 1s max transactions: 264552 clients: 1 drop rate: 0.76

[2024-01-31T19:02:38.746203621Z INFO solana_bench_tps::bench] Average TPS: 4400.843

Highest TPS: 39137.56 sampling period 1s max transactions: 384987 clients: 1 drop rate: 0.65

[2024-01-31T19:04:17.896044976Z INFO solana_bench_tps::bench] Average TPS: 6338.43

Highest TPS: 29189.24 sampling period 1s max transactions: 388258 clients: 1 drop rate: 0.64

[2024-01-31T19:06:34.395401402Z INFO solana_bench_tps::bench] Average TPS: 6257.952

t-nelson

i don't think benchtps is how we want to be testing this. it's too far away and is biased by everything after the scheduler, that we don't care about. i think we should be using the benchmarks in perf, sigverify and banking instead. i'm especially concerned about whether we negatively impacting our capacity in front of sigverify load shedding

core/src/banking_stage/consumer.rs

transaction-metrics-tracker/src/lib.rs

streamer/src/nonblocking/quic.rs

t-nelson · 2024-02-22T05:50:16Z

core/src/banking_stage/consumer.rs

@@ -206,6 +206,26 @@ impl Consumer {
            .slot_metrics_tracker
            .increment_retryable_packets_count(retryable_transaction_indexes.len() as u64);

+        // Now we track the performance for the interested transactions which is not in the retryable_transaction_indexes
+        // We assume the retryable_transaction_indexes is already sorted.
+        for (index, packet) in packets_to_process.iter().enumerate() {


are the actual iterations not like one stack frame deeper in most cases?

streamer/src/nonblocking/quic.rs

apfitzge · 2024-02-29T22:41:01Z

core/src/banking_stage/consumer.rs

+                        debug!(
+                            "Banking stage processing took {duration:?} for transaction {:?}",
+                            packet.transaction().get_signatures().first()
+                        );


This is really misleading. It didn't take that amount of time to process this transaction. It took that time to process the batch of transactions.

apfitzge · 2024-02-29T22:55:28Z

core/src/banking_stage/leader_slot_timing_metrics.rs

@@ -244,6 +244,9 @@ pub(crate) struct ProcessPacketsTimings {
    // Time spent running the cost model in processing transactions before executing
    // transactions
    pub cost_model_us: u64,
+
+    // banking stage processing time histogram for sampled packets
+    pub process_sampled_packets_us_hist: histogram::Histogram,


This will almost certainly have between 0-2 counts per block on mnb, meaning it will probably be so noisy as to be useless.

I think we care much more about time from sigverify to banking stage picking up the packet from channel, i.e. "time-to-scheduler"

The sampling mechanism goal was to sample the system without tracking everything. Even a couple of data points per slot over long time time can still provide insight into where the time is spent over various stage. Time to scheduler is good stats to have. I will defer to future PRs to reduce this change set.

streamer/src/nonblocking/quic.rs

core/src/sigverify_stage.rs

trying txn mask matching output txn to figure out why txn is not exactly matched Use 62 and 61 portion track fetch performance using random txn mask track sigverify performance using random txn mask track banking stage performance using random txn mask adding missing cargo lock file add debug messages Revert "add debug messages" This reverts commit 96aead5. fixed some clippy issues check-crate issue Fix a clippy issue Fix a clippy issue debug why txns in banking stage shows fewer performance tracking points debug why txns in banking stage shows fewer performance tracking points debug why txns in banking stage shows fewer performance tracking points debug why txns in banking stage shows fewer performance tracking points get higher PPS for testing purpose more debug messages on why txn is skipped display if tracer packet in log add debug before calling processing_function debug at the initial of banking stage track if a txn is forwarded dependency order missing cargo file clean up debug messages Do not use TRACER_PACKET, use its own bit rename some functions addressed some comments from Trent Update core/src/banking_stage/immutable_deserialized_packet.rs Co-authored-by: Trent Nelson <[email protected]> addressed some comments from Trent Do not use binary_search, do simple compare in one loop

… perf track

willhickey · 2024-03-03T04:56:15Z

This repository is no longer in use. Please re-open this pull request in the agave repo: https://github.com/anza-xyz/agave

lijunwangs force-pushed the track_transaction_performance branch from baed896 to 45efd53 Compare January 18, 2024 23:17

lijunwangs force-pushed the track_transaction_performance branch 2 times, most recently from 4d4ba9a to 0991665 Compare January 22, 2024 19:18

lijunwangs requested review from pgarg66 and t-nelson January 22, 2024 20:38

t-nelson suggested changes Jan 23, 2024

View reviewed changes

github-actions bot added stale [bot only] Added to stale content; results in auto-close after a week. and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Feb 15, 2024

lijunwangs force-pushed the track_transaction_performance branch from 148feaf to 2207cb2 Compare February 20, 2024 18:54

t-nelson reviewed Feb 22, 2024

View reviewed changes

lijunwangs force-pushed the track_transaction_performance branch from 2207cb2 to a8868ab Compare February 29, 2024 09:06

apfitzge reviewed Feb 29, 2024

View reviewed changes

streamer/src/nonblocking/quic.rs Outdated Show resolved Hide resolved

apfitzge reviewed Feb 29, 2024

View reviewed changes

streamer/src/nonblocking/quic.rs Outdated Show resolved Hide resolved

apfitzge reviewed Feb 29, 2024

View reviewed changes

streamer/src/nonblocking/quic.rs Outdated Show resolved Hide resolved

apfitzge reviewed Feb 29, 2024

View reviewed changes

streamer/src/nonblocking/quic.rs Outdated Show resolved Hide resolved

apfitzge reviewed Feb 29, 2024

View reviewed changes

core/src/sigverify_stage.rs Show resolved Hide resolved

lijunwangs added 10 commits March 1, 2024 01:47

Use datapoint as opposed to counters

792095c

Add a unit test

80a510e

removed a print

d2a5bdf

Added more unit tests

921c9f5

Making stats names consistent across layers

15eb1cb

Added more unit tests

f252ff1

Added more unit tests

4d1e1cc

Do not use binary_search, do simple compare in one loop

b12a19e

measure perf track overhead

caf6a3c

lijunwangs added 2 commits March 1, 2024 01:47

missing cargo.lock

ee7a5e2

Do not use Hashmap for perf track. Using vec. Measure the overhead of…

d566aa0

… perf track

lijunwangs force-pushed the track_transaction_performance branch from 697e141 to d566aa0 Compare March 1, 2024 09:48

Clippy issue

6a4f974

willhickey closed this Mar 3, 2024

This was referenced Mar 13, 2024

Track transaction performance track anza-xyz/agave#202

Open

transaction performance tracking -- streamer stage anza-xyz/agave#257

Merged

This was referenced Apr 5, 2024

v1.18: transaction performance tracking -- streamer stage (backport of #257) anza-xyz/agave#597

Closed

v1.17: transaction performance tracking -- streamer stage (backport of #257) anza-xyz/agave#598

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track transaction performance through various stage using random mask #34789

Track transaction performance through various stage using random mask #34789

lijunwangs commented Jan 16, 2024 •

edited

Loading

codecov bot commented Jan 19, 2024 •

edited

Loading

t-nelson left a comment

t-nelson Jan 23, 2024

lijunwangs Jan 31, 2024

t-nelson Feb 22, 2024

lijunwangs Feb 29, 2024

apfitzge Feb 29, 2024 •

edited

Loading

lijunwangs Mar 1, 2024

lijunwangs Mar 2, 2024

lijunwangs commented Jan 31, 2024

t-nelson left a comment

t-nelson Feb 22, 2024

apfitzge Feb 29, 2024

apfitzge Feb 29, 2024

apfitzge Feb 29, 2024 •

edited

Loading

lijunwangs Mar 1, 2024

willhickey commented Mar 3, 2024

Track transaction performance through various stage using random mask #34789

Track transaction performance through various stage using random mask #34789

Conversation

lijunwangs commented Jan 16, 2024 • edited Loading

Problem

Summary of Changes

codecov bot commented Jan 19, 2024 • edited Loading

Codecov Report

t-nelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apfitzge Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lijunwangs commented Jan 31, 2024

t-nelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apfitzge Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

willhickey commented Mar 3, 2024

lijunwangs commented Jan 16, 2024 •

edited

Loading

codecov bot commented Jan 19, 2024 •

edited

Loading

apfitzge Feb 29, 2024 •

edited

Loading

apfitzge Feb 29, 2024 •

edited

Loading