-
Notifications
You must be signed in to change notification settings - Fork 16
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Performance and Tracing update 2024-11-18 (#506)
- Loading branch information
Showing
1 changed file
with
72 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
--- | ||
title: Performance & Tracing Update | ||
slug: 2024-11-18-performance-and-tracing | ||
authors: mgmeier | ||
tags: [performance-tracing] | ||
hide_table_of_contents: false | ||
--- | ||
|
||
## High level summary | ||
|
||
* Benchmarking: Further Governance action / voting benchmarks on Node `10.0`. | ||
* Development: New protoype for database-backed persistence layer in our analysis tool `locli`. | ||
* Workbench: More fine-grained genesis caching; export cluster topology for Leios simulation. | ||
* Tracing: Final round of metrics alignment complete; prepared for `typed-protocols-0.3` bump; new tracing system rollout starting with Node `10.2`. | ||
|
||
|
||
## Low level overview | ||
|
||
|
||
### Benchmarking | ||
|
||
We've been working on improving the voting workload for benchmarks along two axes: Firstly, reduce the (slight) overhead that | ||
decentralized vote submission induces. Secondly, introduce a scaling parameter - namely the number of votes submitted per transaction, and hence | ||
the number of proposals to be considered simultaneously for tallying and ratification. On the way, we improved upon timing of submissions, as | ||
this has caused benchmarks to abort mid-run every now and then: in those cases, a newly created UTxO entry just hadn't settled across the cluster when it was | ||
supposed to be reused for consumption. | ||
|
||
Scaling of the voting workload is currently under analysis. | ||
|
||
|
||
### Development | ||
|
||
Our analysis and reporting tool, `locli` ("LogObject CLI") has a few drawbacks as far as system resource usage goes; it requires a huge | ||
amount of RAM, and initialization (i.e. loading and parsing trace output) is quite slow. Moreover, there is no intermediate, potentially | ||
exposable or queryable, representation of data besides the trace messages themselves. | ||
|
||
We're working on a prototype that introduces a database persistence layer as that intermediate representation. Not only does that open | ||
up raw benchmarking data to other means of querying or processing outside `locli`. Initializing the tool from the database has also shown | ||
to require much less RAM, and to improve duration of the initialization phase. Furthermore, on-disk representation is much more efficient that | ||
way - which is no small benefit when raw benchmarking data for a single run can occupy north of 64GiB. | ||
|
||
The prototype has yet to be fully integrated into the analysis pipeline for validation, however, initial observations are promising. | ||
|
||
|
||
### Workbench | ||
|
||
For our benchmarks, we rely on staked geneses, as the cluster needs control all stake, and such, block production, to yield meaningful performance | ||
metrics. As creating a staked genesis of that extent is an expensive operation, we use a caching mechanism for those. Small changes in the benchmarking | ||
profile, such as protocol version or parameters, Plutus cost models or execution budgets would usually trigger the creation of a new cache entry. We've | ||
now factored out from cache entry resolution all those variables that do not impact staking itself. We then created a mechnanism to patch those | ||
changes into genesis files after cache retrieval, when preparing them for a benchmarking run. This adds flexibility for creating profiles, and reduces the | ||
time to deploy a run to the cluster. | ||
|
||
We also delivered a comprehensive description of our cluster to the Leios innovation team. This includes the definition of our artificially constrained | ||
topology, as well as a latency matrix for node connections in that topology, assigning a weight to all edges in the graph. The Leios team intends | ||
to use that material to implement a large-scale simulation of the algorithm, and thus gain representative timings for diffusion and propagation. | ||
|
||
|
||
### Tracing | ||
|
||
The alignment of metrics names between legacy and new tracing system is now complete - which should minimize the migration effort of existing dashboards for the community. The only differences that remain are motivated by increasing compliance with existing standards like e.g. OpenMetrics. Furthermore, a few metrics still | ||
missing in the new system have now been ported over, such as `node.start.time` or `served.block.latest`. | ||
|
||
We're all set for the expected bump to `typed-protocols-0.3`: both forwarder packages `trace-forward` and `ekg-forward` for the new tracing | ||
system have been adapted to the new API and are passing all tests. | ||
|
||
Last not least, we've settled on a rollout plan for the new tracing system. The new system will set to be the **default** with the upcoming Node | ||
release `10.2`. This is achieved by a change of configuration only - there is no need for different Node builds. The `cardano-node` binary | ||
will contain both tracing systems for a considerable grace period: 3 - 6 months after release. This should give the community ample time to | ||
adjust for necessary changes in downstream services or dashboards that consume trace or metrics output. | ||
|
||
We'll provide a comprehensive hands-on migration guide summarizing those changes for the user. |