-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyDarshan: Killed when creating Darshan report for very large darshan logs #779
Comments
Hi @hammad45 We've received similar reports from other users and have plans to revamp our methods for extracting DXT data and other record data. Unfortunately, I don't think we have any suggestions in the shorter term... to improve this, we'll likely need to make some changes to our underlying C log parsing libraries as well as to PyDarshan, so it's not something we could reasonably tune right now. Thanks for the report and for sharing your log files with us so we have some examples to try to improve. The goal is to have this sorted out for our next release, but that's probably at least a couple of months out. I'll leave this issue open and try to update as we make progress -- it's possible we could have an experimental feature branch you could try out ahead of an official release. |
@shanedsnyder where'd you get the log files? I can't seem to access on my end? |
@shanedsnyder once you have an experimental branch, please let us know! @hammad45 is working with us to update the DXT Explorer to use PyDarshan and we run into this issue with some of the logs that worked ok with the default parser |
@jeanbez can you see the logs on the Google drive link? I'm wondering if it is just getting blocked on my network or something? Anyway, I'd like access to one. |
Thanks for looking @tylerjereddy, I actually don't see the logs either. |
@tylerjereddy Apologies, the logs were on my google drive but accidentally got deleted. I'll upload them again. Thank you. |
@tylerjereddy Here is the link to the log files. |
@hammad45 I can't reproduce any crashes, though I can see how they'd happen with less memory, and the I'll take a look at the code paths to see if something can be streamlined a bit on the Python side, but this seems somewhat tractable on a i.e., supercomputer node, though likely not a laptop.
That said, If you want to generate an actual HTML summary report with i.e.,
it looks like that would be quite problematic as the memory footprint will exceed 128 GB, probably because of issues related to gh-692. |
@tylerjereddy, since we are using pyDarshan to build some interactive visualizations (by replacing the regular command line parser), users normally would run it on laptops. In our tests, we used one with 16GB of memory, which could explain the failures compared to your test. |
Understood, I'll try to help if I can |
Fixes darshan-hpc#779 * at the moment on `main`, DXT record data is effectively stored as a list of dictionaries of lists of dictionaries that look like this: ``` DXT_list -> [rec0, rec1, ..., recN] recN -> {"id":, ..., "rank":, ..., "write_segments": ..., ...} recN["write_segments"] -> [seg0, seg1, ..., segN] segN -> {"offset": int, "length": int, "start_time": float, "end_time": float} ``` - the list of segments is extremely memory inefficient, with the smallest file in the matching issue exceeding 20 GB of physical memory in `mod_read_all_dxt_records`: ``` Line # Mem usage Increment Occurrences Line Contents 852 # fetch records 853 92.484 MiB 18.820 MiB 1 rec = backend.log_get_dxt_record(self.log, mod, dtype=dtype) 854 20295.188 MiB 0.773 MiB 1025 while rec != None: 855 20295.188 MiB 0.000 MiB 1024 self.records[mod].append(rec) 856 20295.188 MiB 0.000 MiB 1024 self.data['modules'][mod]['num_records'] += 1 857 858 # fetch next 859 20295.188 MiB 20201.930 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - if we switch to NumPy arrays the memory footprint drops a lot (see below), and the performance informally seems similar (36 seconds vs. 33 seconds on `main` to produce a `report` object with smallest file in matching issue): ``` Line # Mem usage Increment Occurrences Line Contents 859 3222.547 MiB 3146.344 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - this branch currently uses NumPy record arrays, because I thought they'd be a better fit for a data structure with 2 int columns and 2 float columns; however, there is a big performance hit over regular NumPy arrays (almost 6 minutes vs. 33 seconds for the smallest file in matchin issue); so, if we could live without the extra dtype structuring of a recarray, maybe that would be best (we could also try to use a pandas dataframe, which is another natural fit for dtype columns..)
See gh-784 for a draft approach with improved memory efficiency. There is a detailed discussion there about performance tradeoffs, so I may need to walk back the decision to use record arrays. |
Keep in mind that basically all of the approaches I describe there involve changing how we store DXT data, so you may need to adjust for that. Anyway, I think the prospects for improvement on the Python side are pretty solid based on the draft code there, especially if we can revert back to a NumPy array instead of the slower recarrays. |
Fixes darshan-hpc#779 * at the moment on `main`, DXT record data is effectively stored as a list of dictionaries of lists of dictionaries that look like this: ``` DXT_list -> [rec0, rec1, ..., recN] recN -> {"id":, ..., "rank":, ..., "write_segments": ..., ...} recN["write_segments"] -> [seg0, seg1, ..., segN] segN -> {"offset": int, "length": int, "start_time": float, "end_time": float} ``` - the list of segments is extremely memory inefficient, with the smallest file in the matching issue exceeding 20 GB of physical memory in `mod_read_all_dxt_records`: ``` Line # Mem usage Increment Occurrences Line Contents 852 # fetch records 853 92.484 MiB 18.820 MiB 1 rec = backend.log_get_dxt_record(self.log, mod, dtype=dtype) 854 20295.188 MiB 0.773 MiB 1025 while rec != None: 855 20295.188 MiB 0.000 MiB 1024 self.records[mod].append(rec) 856 20295.188 MiB 0.000 MiB 1024 self.data['modules'][mod]['num_records'] += 1 857 858 # fetch next 859 20295.188 MiB 20201.930 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - if we switch to NumPy arrays the memory footprint drops a lot (see below), and the performance informally seems similar (36 seconds vs. 33 seconds on `main` to produce a `report` object with smallest file in matching issue): ``` Line # Mem usage Increment Occurrences Line Contents 859 3222.547 MiB 3146.344 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - this branch currently uses NumPy record arrays, because I thought they'd be a better fit for a data structure with 2 int columns and 2 float columns; however, there is a big performance hit over regular NumPy arrays (almost 6 minutes vs. 33 seconds for the smallest file in matchin issue); so, if we could live without the extra dtype structuring of a recarray, maybe that would be best (we could also try to use a pandas dataframe, which is another natural fit for dtype columns..)
Fixes darshan-hpc#779 * at the moment on `main`, DXT record data is effectively stored as a list of dictionaries of lists of dictionaries that look like this: ``` DXT_list -> [rec0, rec1, ..., recN] recN -> {"id":, ..., "rank":, ..., "write_segments": ..., ...} recN["write_segments"] -> [seg0, seg1, ..., segN] segN -> {"offset": int, "length": int, "start_time": float, "end_time": float} ``` - the list of segments is extremely memory inefficient, with the smallest file in the matching issue exceeding 20 GB of physical memory in `mod_read_all_dxt_records`: ``` Line # Mem usage Increment Occurrences Line Contents 852 # fetch records 853 92.484 MiB 18.820 MiB 1 rec = backend.log_get_dxt_record(self.log, mod, dtype=dtype) 854 20295.188 MiB 0.773 MiB 1025 while rec != None: 855 20295.188 MiB 0.000 MiB 1024 self.records[mod].append(rec) 856 20295.188 MiB 0.000 MiB 1024 self.data['modules'][mod]['num_records'] += 1 857 858 # fetch next 859 20295.188 MiB 20201.930 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - if we switch to NumPy arrays the memory footprint drops a lot (see below), and the performance informally seems similar (36 seconds vs. 33 seconds on `main` to produce a `report` object with smallest file in matching issue): ``` Line # Mem usage Increment Occurrences Line Contents 859 3222.547 MiB 3146.344 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - this branch currently uses NumPy record arrays, because I thought they'd be a better fit for a data structure with 2 int columns and 2 float columns; however, there is a big performance hit over regular NumPy arrays (almost 6 minutes vs. 33 seconds for the smallest file in matchin issue); so, if we could live without the extra dtype structuring of a recarray, maybe that would be best (we could also try to use a pandas dataframe, which is another natural fit for dtype columns..)
I am using PyDarshan to read and parse darshan dxt logs. The library works fine for small log files but when I try to generate a darshan report for very large log files, the process gets killed when generating the darshan report. I am using the following code to generate the report:
report = darshan.DarshanReport(self.args.darshan, read_all=True)
self.args.darshan
contains arguments such as the darshan filename, start time, end time etc.After running the code, I get a
killed
error message. I have also diagnosed thereport.py
code to see where the problem is. The execution stops atmod_read_all_dxt_records
function probably because the file is too large to process and that is when I get thekilled
error message.I have also attached the log files for which I am getting this issue. The files can be found here. Please let me know if anything else is required.
The text was updated successfully, but these errors were encountered: