Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyDarshan job summary cleanup #910

Closed
13 tasks done
shanedsnyder opened this issue Feb 28, 2023 · 4 comments
Closed
13 tasks done

PyDarshan job summary cleanup #910

shanedsnyder opened this issue Feb 28, 2023 · 4 comments

Comments

@shanedsnyder
Copy link
Contributor

shanedsnyder commented Feb 28, 2023

From an in-person review of the current Darshan job summary reports at Argonne, it would be nice to try to address as much of this as we can before doing a release.

First some high-level things up for discussion:

  • drop blue I/O performance estimate text box, and instead add a per-module table detailing key details of the module (including total files/datasets(H5D)/variables(PNETCDF_VAR) accessed, total bytes read/written, and I/O performance estimate)
    • this way, each module starts off with a high-level summary table that describes key details of I/O, and gets rid of awkward text object we are using now
  • disable DXT heatmap plot generation by default, print warning if DXT is available but HEATMAP isn't, and add CLI option to force DXT usage
    • for some logs, it's probably just too expensive to generate DXT heatmaps by default, so this protects against that
    • this would require us to either push a little on CLI stuff in WIP, ENH: rich-click CLI #809 or use something simple temporarily?
  • after dust settles, potentially reorganize plots/tables in per-module sections?
    • nothing concrete to suggest just yet (Tyler -0.5 on this)

Smaller things:

  • Captions are hard to read (light colored text, also italicized)
    image
  • Add "% runtime" y-axis label to right-side of I/O cost plot
    image
  • Offset autolabeling text at 45 degrees in access size histogram plot and POSIX access pattern plot as is done in I/O op count graph
  • Don't autolabel 0 values to further help with cluttering of plots
  • If not too complicated, can we add more buffer to top of these plots to avoid autolabels overlapping with plot boundaries?
  • We should use a ylim of 0 for access histogram and op count plots to ensure no negative values are plotted.
    image
  • Erroneous "type" row in File Count Summary table
    image
  • drop scientific notation entirely (partially addressed in MAINT: integer r/w vals for data access #812?), and rely on humanize to generate sensible human readable byte strings?
  • File access by category table font really small
    image
  • Pointed out in ENH: PyDarshan job summary print warning and bail on logs with no data #907, add a forgotten space to our warning string when a log has no data
@tylerjereddy
Copy link
Collaborator

There is some duplication with i.e., gh-804 here as well.

@shanedsnyder
Copy link
Contributor Author

I embarrassingly can't figure this one out:

Erroneous "type" row in File Count Summary table

Seems like it ought to be a one-liner to move the "type" index up to a column like other values, but I can't quite seem to pull it off.

Most of the other issues I've already addressed in #907

@tylerjereddy
Copy link
Collaborator

It wasn't an error--it was an intentional index label. You can remove it, but then you'll need to shim the test suite as well (which is pretty good to do even for aesthetic changes anyway). This is pretty straightforward, but it wasn't a bug (it was intentional) so I'd prefer not using this approach:

--- a/darshan-util/pydarshan/darshan/lib/accum.py
+++ b/darshan-util/pydarshan/darshan/lib/accum.py
@@ -71,7 +71,6 @@ def log_file_count_summary_table(derived_metrics,
                              "read/write files":3}
     df = pd.DataFrame.from_dict(darshan_file_category, orient="index")
     df.rename(columns={0:"index"}, inplace=True)
-    df.index.rename('type', inplace=True)
     df["number of files"] = np.zeros(4, dtype=int)
     df["avg. size"] = np.zeros(4, dtype=str)
     df["max size"] = np.zeros(4, dtype=str)
diff --git a/darshan-util/pydarshan/darshan/tests/test_lib_accum.py b/darshan-util/pydarshan/darshan/tests/test_lib_accum.py
index 080264d1..3d23fb76 100644
--- a/darshan-util/pydarshan/darshan/tests/test_lib_accum.py
+++ b/darshan-util/pydarshan/darshan/tests/test_lib_accum.py
@@ -203,7 +203,6 @@ def test_file_count_summary_table(log_name,
                          "read-only files",
                          "write-only files",
                          "read/write files"]
-    expected_df.index.rename('type', inplace=True)
 
     log_path = get_log_path(log_name)
     with darshan.DarshanReport(log_path, read_all=True) as report:

image

I'd prefer if you didn't mutate the data structures for the sole purpose of removing a row in the report, and instead do something like this (a small test may be warranted) so that end users get what they want from to_html but no need to inconvenience developers who like index labels:

--- a/darshan-util/pydarshan/darshan/lib/accum.py
+++ b/darshan-util/pydarshan/darshan/lib/accum.py
@@ -100,5 +100,6 @@ def log_file_count_summary_table(derived_metrics,
     df.drop(columns="index", inplace=True)
     ret = plot_common_access_table.DarshanReportTable(df,
                                                       col_space=200,
-                                                      justify="center")
+                                                      justify="center",
+                                                      index_names=False)
     return ret

@shanedsnyder
Copy link
Contributor Author

Last bits of this closed via #927.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants