Performance and Tracing update 2024-11-18 (#506)

IntersectMBO · Nov 19, 2024 · 78fa21b · 78fa21b
1 parent ba87793
commit 78fa21b
Showing 1 changed file with 72 additions and 0 deletions.
diff --git a/blog/2024-11-18-performance-and-tracing.md b/blog/2024-11-18-performance-and-tracing.md
@@ -0,0 +1,72 @@
+---
+title: Performance & Tracing Update
+slug: 2024-11-18-performance-and-tracing
+authors: mgmeier
+tags: [performance-tracing]
+hide_table_of_contents: false
+---
+
+## High level summary
+
+* Benchmarking: Further Governance action / voting benchmarks on Node `10.0`.
+* Development: New protoype for database-backed persistence layer in our analysis tool `locli`.
+* Workbench: More fine-grained genesis caching; export cluster topology for Leios simulation.
+* Tracing: Final round of metrics alignment complete; prepared for `typed-protocols-0.3` bump; new tracing system rollout starting with Node `10.2`.
+
+
+## Low level overview
+
+
+### Benchmarking
+
+We've been working on improving the voting workload for benchmarks along two axes: Firstly, reduce the (slight) overhead that
+decentralized vote submission induces. Secondly, introduce a scaling parameter - namely the number of votes submitted per transaction, and hence
+the number of proposals to be considered simultaneously for tallying and ratification. On the way, we improved upon timing of submissions, as
+this has caused benchmarks to abort mid-run every now and then: in those cases, a newly created UTxO entry just hadn't settled across the cluster when it was
+supposed to be reused for consumption.  
+
+Scaling of the voting workload is currently under analysis.
+
+
+### Development
+
+Our analysis and reporting tool, `locli` ("LogObject CLI") has a few drawbacks as far as system resource usage goes; it requires a huge
+amount of RAM, and initialization (i.e. loading and parsing trace output) is quite slow. Moreover, there is no intermediate, potentially
+exposable or queryable, representation of data besides the trace messages themselves.  
+
+ We're working on a prototype that introduces a database persistence layer as that intermediate representation. Not only does that open
+ up raw benchmarking data to other means of querying or processing outside `locli`. Initializing the tool from the database has also shown
+ to require much less RAM, and to improve duration of the initialization phase. Furthermore, on-disk representation is much more efficient that
+ way - which is no small benefit when raw benchmarking data for a single run can occupy north of 64GiB.
+
+ The prototype has yet to be fully integrated into the analysis pipeline for validation, however, initial observations are promising. 
+
+
+### Workbench
+
+For our benchmarks, we rely on staked geneses, as the cluster needs control all stake, and such, block production, to yield meaningful performance
+metrics. As creating a staked genesis of that extent is an expensive operation, we use a caching mechanism for those. Small changes in the benchmarking
+profile, such as protocol version or parameters, Plutus cost models or execution budgets would usually trigger the creation of a new cache entry. We've
+now factored out from cache entry resolution all those variables that do not impact staking itself. We then created a mechnanism to patch those
+changes into genesis files after cache retrieval, when preparing them for a benchmarking run. This adds flexibility for creating profiles, and reduces the
+time to deploy a run to the cluster.  
+
+We also delivered a comprehensive description of our cluster to the Leios innovation team. This includes the definition of our artificially constrained
+topology, as well as a latency matrix for node connections in that topology, assigning a weight to all edges in the graph. The Leios team intends
+to use that material to implement a large-scale simulation of the algorithm, and thus gain representative timings for diffusion and propagation.
+
+
+### Tracing
+
+The alignment of metrics names between legacy and new tracing system is now complete - which should minimize the migration effort of existing dashboards for the community. The only differences that remain are motivated by increasing compliance with existing standards like e.g. OpenMetrics. Furthermore, a few metrics still
+missing in the new system have now been ported over, such as `node.start.time` or `served.block.latest`.  
+
+We're all set for the expected bump to `typed-protocols-0.3`: both forwarder packages `trace-forward` and `ekg-forward` for the new tracing
+system have been adapted to the new API and are passing all tests.
+
+Last not least, we've settled on a rollout plan for the new tracing system. The new system will set to be the **default** with the upcoming Node
+release `10.2`. This is achieved by a change of configuration only - there is no need for different Node builds. The `cardano-node` binary
+will contain both tracing systems for a considerable grace period: 3 - 6 months after release. This should give the community ample time to
+adjust for necessary changes in downstream services or dashboards that consume trace or metrics output.  
+
+We'll provide a comprehensive hands-on migration guide summarizing those changes for the user.