Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs: add transaction stats #6556

Merged
merged 4 commits into from
Jan 21, 2025
Merged

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Jan 16, 2025

Problem: In the future we would like to put a cap on the maximum number
of operations in a transaction. However, we do not have any stat tracking
of operations in a transaction.

Add transaction stats for commits and fences in the kvs module.


This is really a "prep" for #6125. I decided to split out the stats for commits & fences, since they are quite different. For example, if every process in an job did a fenced KVS transaction with a single op, this scales up with the size of cluster. So we want to count that differently than each op individually.

Note that I could have done "stats per namespace", but that wasn't really the information that we want. We mostly want to know what is the average and what is the max. So I did it "globally" vs per namespace.

@chu11
Copy link
Member Author

chu11 commented Jan 17, 2025

oops, looks like our CI doesn't have bc so everything failed :-) Re-worked the tests to use jq instead (actually is much cleaner now).

Edit: agh ... try again. CI didn't like my use of single quotes.

@chu11 chu11 force-pushed the kvs_txn_stats branch 5 times, most recently from 4506ffa to d8deacd Compare January 18, 2025 00:34
@chu11
Copy link
Member Author

chu11 commented Jan 18, 2025

oops, i totally forgot about stats clearing. Tweaked and added a test.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the test failures are worked out.

I did just have a couple minor suggestions.

Also a passing comment that we should fix the stats key names that have spaces and # in them at some point.

Comment on lines 2404 to 2418
if (!(txncstats = json_pack ("{ s:i s:i s:f s:f s:i }",
"count", tstat_count (&ctx->txn_commit_stats),
"min", (int)tstat_min (&ctx->txn_commit_stats),
"mean", tstat_mean (&ctx->txn_commit_stats),
"stddev", tstat_stddev (&ctx->txn_commit_stats),
"max",
(int)tstat_max (&ctx->txn_commit_stats))))
goto nomem;

if (!(txnfstats = json_pack ("{ s:i s:i s:f s:f s:i }",
"count", tstat_count (&ctx->txn_fence_stats),
"min", (int)tstat_min (&ctx->txn_fence_stats),
"mean", tstat_mean (&ctx->txn_fence_stats),
"stddev", tstat_stddev (&ctx->txn_fence_stats),
"max", (int)tstat_max (&ctx->txn_fence_stats))))
goto nomem;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as the tstats object except max is packed as an intetger.

Would adding a add_tstat_to_object() function be useful here to get all three in the same format, reduce code duplication, and avoid the need for a temporary variable per object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea. However, I do dislike that the min and max for op-count would be a double if we created a common function for this. Although it's not the end of the world if that were the case.

For the cache stats, I suppose min and max should also always be integers (or big Ints) too. However, that would be changing the output format. Hmmm, lemme ponder this detail for a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the output format here doesn't seem so bad.

"cache", cstats,
"namespace", nsstats,
"transactions",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: make this key more descriptive like transaction-opcount?

chu11 added 4 commits January 21, 2025 13:27
Problem: The min/max cache object size stat is returned as floating
point, even though such a statistic cannot have a decimal place by
definition (in contrast to lets say "mean" or "stdev").

Return the min/max cache object size as an integer instead of a
floating point.
Problem: In the future we would like to put a cap on the maximum number
of operations in a transaction.  However, we do not have any stat tracking
of operations in a transaction.

Add transaction stats for commits and fences in the kvs module.
Problem: The testing of stats clearing in the KVS is done
a little bit before the end of a set of tests.  This can be
inconvenient for future tests that do not wish stats to be cleared.

Move the clearing of stats tests towards the end of tests in
t1001-kvs-internals.t.
Problem: There is no coverage for the new KVS transaction stats.

Add coverage in t1001-kvs-internals.t.
@chu11
Copy link
Member Author

chu11 commented Jan 21, 2025

ok, re-pushed per comments above. The big change is that the cache object stats now use ints for the output of min and max.

That allowed me to create a convenience function for the cache object stats & the new transaction stats to generation objects for tstat.

I elected to go with the convenience function instead of a new tstats function b/c of A) the casting of the min/max to ints and B) the scaling option to scale the stats (used by cache object stats). It just seemed like too much special stuff just for the kvs.

Comment on lines 2467 to 2471
json_decref (tstats);
json_decref (cstats);
json_decref (txncstats);
json_decref (txnfstats);
json_decref (nsstats);
Copy link
Member

@garlick garlick Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the convenience function added a key to an existing object rather than return an object, it would save on the temp variables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought about that, but that's only for the first case w/ the object stats here.

    if (!(cstats = json_pack ("{ s:f s:O s:i s:i s:i }",                                                                                    
                              "obj size total (MiB)", (double)size/1048576,                                                                 
                              "obj size (KiB)", tstats,                                                                                     
                              "#obj dirty", dirty,                                                                                          
                              "#obj incomplete", incomplete,                                                                                
                              "#faults", ctx->faults)))                                                                                     
        goto nomem;   

Most of the "object' construction is done with a rpc response at the end:

    if (flux_respond_pack (h,                                                                                                               
                           msg,                                                                                                             
                           "{ s:O s:O s:{s:O s:O} s:i }",                                                                                   
                           "cache", cstats,                                                                                                 
                           "namespace", nsstats,                                                                                            
                           "transaction-opcount",                                                                                           
                             "commit", txncstats,                                                                                           
                             "fence", txnfstats,                                                                                            
                           "pending_requests", zhashx_size (ctx->requests)) < 0)                                                            
        flux_log_error (h, "%s: flux_respond_pack", __FUNCTION__);   

we could create a "transaction-ops" object with json_object() and pass it around. It seemed like a "meh" win to me ... feel strongly about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope :-)

@chu11
Copy link
Member Author

chu11 commented Jan 21, 2025

thanks, setting MWP

@mergify mergify bot merged commit 2088011 into flux-framework:master Jan 21, 2025
35 checks passed
Copy link

codecov bot commented Jan 21, 2025

Codecov Report

Attention: Patch coverage is 78.57143% with 6 lines in your changes missing coverage. Please review.

Project coverage is 79.46%. Comparing base (d32d543) to head (dffe9e4).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
src/modules/kvs/kvs.c 78.57% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #6556       +/-   ##
===========================================
+ Coverage   41.41%   79.46%   +38.04%     
===========================================
  Files         478      531       +53     
  Lines       79878    88309     +8431     
===========================================
+ Hits        33082    70171    +37089     
+ Misses      46796    18138    -28658     
Files with missing lines Coverage Δ
src/modules/kvs/kvs.c 72.43% <78.57%> (+31.91%) ⬆️

... and 475 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants