PyDarshan: Killed when creating Darshan report for very large darshan logs #779

hammad45 · 2022-07-22T23:52:19Z

I am using PyDarshan to read and parse darshan dxt logs. The library works fine for small log files but when I try to generate a darshan report for very large log files, the process gets killed when generating the darshan report. I am using the following code to generate the report:

report = darshan.DarshanReport(self.args.darshan, read_all=True)

self.args.darshan contains arguments such as the darshan filename, start time, end time etc.

After running the code, I get a killed error message. I have also diagnosed the report.py code to see where the problem is. The execution stops at mod_read_all_dxt_records function probably because the file is too large to process and that is when I get the killed error message.

I have also attached the log files for which I am getting this issue. The files can be found here. Please let me know if anything else is required.

The text was updated successfully, but these errors were encountered:

shanedsnyder · 2022-07-25T14:31:03Z

Hi @hammad45

We've received similar reports from other users and have plans to revamp our methods for extracting DXT data and other record data. Unfortunately, I don't think we have any suggestions in the shorter term... to improve this, we'll likely need to make some changes to our underlying C log parsing libraries as well as to PyDarshan, so it's not something we could reasonably tune right now.

Thanks for the report and for sharing your log files with us so we have some examples to try to improve. The goal is to have this sorted out for our next release, but that's probably at least a couple of months out. I'll leave this issue open and try to update as we make progress -- it's possible we could have an experimental feature branch you could try out ahead of an official release.

tylerjereddy · 2022-07-25T16:35:42Z

Thanks for the report and for sharing your log files with us

@shanedsnyder where'd you get the log files? I can't seem to access on my end?

jeanbez · 2022-07-25T16:37:15Z

@shanedsnyder once you have an experimental branch, please let us know! @hammad45 is working with us to update the DXT Explorer to use PyDarshan and we run into this issue with some of the logs that worked ok with the default parser

tylerjereddy · 2022-07-25T16:48:47Z

@jeanbez can you see the logs on the Google drive link? I'm wondering if it is just getting blocked on my network or something? Anyway, I'd like access to one.

shanedsnyder · 2022-07-25T17:01:01Z

Thanks for looking @tylerjereddy, I actually don't see the logs either.

hammad45 · 2022-07-25T17:03:29Z

@tylerjereddy Apologies, the logs were on my google drive but accidentally got deleted. I'll upload them again. Thank you.

hammad45 · 2022-07-26T15:51:58Z

@tylerjereddy Here is the link to the log files.

tylerjereddy · 2022-07-26T16:45:15Z

@hammad45 I can't reproduce any crashes, though I can see how they'd happen with less memory, and the report objects are generated within about a minute for me. The only thing I'd say is that the memory footprint is indeed a bit high--in the worst case some of the logs use about 1/3 of the 128 GB of memory on my Linux box, so you'd want to use a high memory node/resource if possible.

I'll take a look at the code paths to see if something can be streamlined a bit on the Python side, but this seems somewhat tractable on a i.e., supercomputer node, though likely not a laptop.

Log File	Time to produce `report` object
`run_read_cscratch_lustre.45812167.darshan`	1 minutes, 8 seconds
`run_read_cscratch_base.45799327.darshan`	1 minute, 8 seconds
`dbwy_read_driver_id45426645_8-12-35962-12972753481900684290_1628788319.darshan`	33 seconds

That said, If you want to generate an actual HTML summary report with i.e.,

python -m darshan summary run_read_cscratch_lustre.45812167.darshan

it looks like that would be quite problematic as the memory footprint will exceed 128 GB, probably because of issues related to gh-692.

jeanbez · 2022-07-26T17:06:01Z

@tylerjereddy, since we are using pyDarshan to build some interactive visualizations (by replacing the regular command line parser), users normally would run it on laptops. In our tests, we used one with 16GB of memory, which could explain the failures compared to your test.

tylerjereddy · 2022-07-26T17:11:14Z

Understood, I'll try to help if I can

Fixes darshan-hpc#779 * at the moment on `main`, DXT record data is effectively stored as a list of dictionaries of lists of dictionaries that look like this: ``` DXT_list -> [rec0, rec1, ..., recN] recN -> {"id":, ..., "rank":, ..., "write_segments": ..., ...} recN["write_segments"] -> [seg0, seg1, ..., segN] segN -> {"offset": int, "length": int, "start_time": float, "end_time": float} ``` - the list of segments is extremely memory inefficient, with the smallest file in the matching issue exceeding 20 GB of physical memory in `mod_read_all_dxt_records`: ``` Line # Mem usage Increment Occurrences Line Contents 852 # fetch records 853 92.484 MiB 18.820 MiB 1 rec = backend.log_get_dxt_record(self.log, mod, dtype=dtype) 854 20295.188 MiB 0.773 MiB 1025 while rec != None: 855 20295.188 MiB 0.000 MiB 1024 self.records[mod].append(rec) 856 20295.188 MiB 0.000 MiB 1024 self.data['modules'][mod]['num_records'] += 1 857 858 # fetch next 859 20295.188 MiB 20201.930 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - if we switch to NumPy arrays the memory footprint drops a lot (see below), and the performance informally seems similar (36 seconds vs. 33 seconds on `main` to produce a `report` object with smallest file in matching issue): ``` Line # Mem usage Increment Occurrences Line Contents 859 3222.547 MiB 3146.344 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - this branch currently uses NumPy record arrays, because I thought they'd be a better fit for a data structure with 2 int columns and 2 float columns; however, there is a big performance hit over regular NumPy arrays (almost 6 minutes vs. 33 seconds for the smallest file in matchin issue); so, if we could live without the extra dtype structuring of a recarray, maybe that would be best (we could also try to use a pandas dataframe, which is another natural fit for dtype columns..)

tylerjereddy · 2022-07-27T00:06:13Z

See gh-784 for a draft approach with improved memory efficiency. There is a detailed discussion there about performance tradeoffs, so I may need to walk back the decision to use record arrays.

tylerjereddy · 2022-07-27T02:12:23Z

Keep in mind that basically all of the approaches I describe there involve changing how we store DXT data, so you may need to adjust for that. Anyway, I think the prospects for improvement on the Python side are pretty solid based on the draft code there, especially if we can revert back to a NumPy array instead of the slower recarrays.

Fixes darshan-hpc#779 * at the moment on `main`, DXT record data is effectively stored as a list of dictionaries of lists of dictionaries that look like this: ``` DXT_list -> [rec0, rec1, ..., recN] recN -> {"id":, ..., "rank":, ..., "write_segments": ..., ...} recN["write_segments"] -> [seg0, seg1, ..., segN] segN -> {"offset": int, "length": int, "start_time": float, "end_time": float} ``` - the list of segments is extremely memory inefficient, with the smallest file in the matching issue exceeding 20 GB of physical memory in `mod_read_all_dxt_records`: ``` Line # Mem usage Increment Occurrences Line Contents 852 # fetch records 853 92.484 MiB 18.820 MiB 1 rec = backend.log_get_dxt_record(self.log, mod, dtype=dtype) 854 20295.188 MiB 0.773 MiB 1025 while rec != None: 855 20295.188 MiB 0.000 MiB 1024 self.records[mod].append(rec) 856 20295.188 MiB 0.000 MiB 1024 self.data['modules'][mod]['num_records'] += 1 857 858 # fetch next 859 20295.188 MiB 20201.930 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - if we switch to NumPy arrays the memory footprint drops a lot (see below), and the performance informally seems similar (36 seconds vs. 33 seconds on `main` to produce a `report` object with smallest file in matching issue): ``` Line # Mem usage Increment Occurrences Line Contents 859 3222.547 MiB 3146.344 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - this branch currently uses NumPy record arrays, because I thought they'd be a better fit for a data structure with 2 int columns and 2 float columns; however, there is a big performance hit over regular NumPy arrays (almost 6 minutes vs. 33 seconds for the smallest file in matchin issue); so, if we could live without the extra dtype structuring of a recarray, maybe that would be best (we could also try to use a pandas dataframe, which is another natural fit for dtype columns..)

tylerjereddy added the pydarshan label Jul 24, 2022

tylerjereddy linked a pull request Jul 27, 2022 that will close this issue

WIP, ENH: memory efficient DXT segs #784

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyDarshan: Killed when creating Darshan report for very large darshan logs #779

PyDarshan: Killed when creating Darshan report for very large darshan logs #779

hammad45 commented Jul 22, 2022

shanedsnyder commented Jul 25, 2022

tylerjereddy commented Jul 25, 2022

jeanbez commented Jul 25, 2022

tylerjereddy commented Jul 25, 2022

shanedsnyder commented Jul 25, 2022

hammad45 commented Jul 25, 2022

hammad45 commented Jul 26, 2022

tylerjereddy commented Jul 26, 2022

jeanbez commented Jul 26, 2022

tylerjereddy commented Jul 26, 2022

tylerjereddy commented Jul 27, 2022

tylerjereddy commented Jul 27, 2022

PyDarshan: Killed when creating Darshan report for very large darshan logs #779

PyDarshan: Killed when creating Darshan report for very large darshan logs #779

Comments

hammad45 commented Jul 22, 2022

shanedsnyder commented Jul 25, 2022

tylerjereddy commented Jul 25, 2022

jeanbez commented Jul 25, 2022

tylerjereddy commented Jul 25, 2022

shanedsnyder commented Jul 25, 2022

hammad45 commented Jul 25, 2022

hammad45 commented Jul 26, 2022

tylerjereddy commented Jul 26, 2022

jeanbez commented Jul 26, 2022

tylerjereddy commented Jul 26, 2022

tylerjereddy commented Jul 27, 2022

tylerjereddy commented Jul 27, 2022