Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyDarshan: Killed when creating Darshan report for very large darshan logs #779

Open
hammad45 opened this issue Jul 22, 2022 · 12 comments · May be fixed by #784
Open

PyDarshan: Killed when creating Darshan report for very large darshan logs #779

hammad45 opened this issue Jul 22, 2022 · 12 comments · May be fixed by #784

Comments

@hammad45
Copy link

I am using PyDarshan to read and parse darshan dxt logs. The library works fine for small log files but when I try to generate a darshan report for very large log files, the process gets killed when generating the darshan report. I am using the following code to generate the report:

report = darshan.DarshanReport(self.args.darshan, read_all=True)

self.args.darshan contains arguments such as the darshan filename, start time, end time etc.

After running the code, I get a killed error message. I have also diagnosed the report.py code to see where the problem is. The execution stops at mod_read_all_dxt_records function probably because the file is too large to process and that is when I get the killed error message.

I have also attached the log files for which I am getting this issue. The files can be found here. Please let me know if anything else is required.

@shanedsnyder
Copy link
Contributor

Hi @hammad45

We've received similar reports from other users and have plans to revamp our methods for extracting DXT data and other record data. Unfortunately, I don't think we have any suggestions in the shorter term... to improve this, we'll likely need to make some changes to our underlying C log parsing libraries as well as to PyDarshan, so it's not something we could reasonably tune right now.

Thanks for the report and for sharing your log files with us so we have some examples to try to improve. The goal is to have this sorted out for our next release, but that's probably at least a couple of months out. I'll leave this issue open and try to update as we make progress -- it's possible we could have an experimental feature branch you could try out ahead of an official release.

@tylerjereddy
Copy link
Collaborator

Thanks for the report and for sharing your log files with us

@shanedsnyder where'd you get the log files? I can't seem to access on my end?

image

@jeanbez
Copy link
Contributor

jeanbez commented Jul 25, 2022

@shanedsnyder once you have an experimental branch, please let us know! @hammad45 is working with us to update the DXT Explorer to use PyDarshan and we run into this issue with some of the logs that worked ok with the default parser

@tylerjereddy
Copy link
Collaborator

@jeanbez can you see the logs on the Google drive link? I'm wondering if it is just getting blocked on my network or something? Anyway, I'd like access to one.

@shanedsnyder
Copy link
Contributor

Thanks for looking @tylerjereddy, I actually don't see the logs either.

@hammad45
Copy link
Author

@tylerjereddy Apologies, the logs were on my google drive but accidentally got deleted. I'll upload them again. Thank you.

@hammad45
Copy link
Author

@tylerjereddy Here is the link to the log files.

@tylerjereddy
Copy link
Collaborator

@hammad45 I can't reproduce any crashes, though I can see how they'd happen with less memory, and the report objects are generated within about a minute for me. The only thing I'd say is that the memory footprint is indeed a bit high--in the worst case some of the logs use about 1/3 of the 128 GB of memory on my Linux box, so you'd want to use a high memory node/resource if possible.

I'll take a look at the code paths to see if something can be streamlined a bit on the Python side, but this seems somewhat tractable on a i.e., supercomputer node, though likely not a laptop.

Log File Time to produce report object
run_read_cscratch_lustre.45812167.darshan 1 minutes, 8 seconds
run_read_cscratch_base.45799327.darshan 1 minute, 8 seconds
dbwy_read_driver_id45426645_8-12-35962-12972753481900684290_1628788319.darshan 33 seconds

That said, If you want to generate an actual HTML summary report with i.e.,

python -m darshan summary run_read_cscratch_lustre.45812167.darshan

it looks like that would be quite problematic as the memory footprint will exceed 128 GB, probably because of issues related to gh-692.

@jeanbez
Copy link
Contributor

jeanbez commented Jul 26, 2022

@tylerjereddy, since we are using pyDarshan to build some interactive visualizations (by replacing the regular command line parser), users normally would run it on laptops. In our tests, we used one with 16GB of memory, which could explain the failures compared to your test.

@tylerjereddy
Copy link
Collaborator

Understood, I'll try to help if I can

tylerjereddy added a commit to tylerjereddy/darshan that referenced this issue Jul 27, 2022
Fixes darshan-hpc#779

* at the moment on `main`, DXT record data is effectively
stored as a list of dictionaries of lists of dictionaries
that look like this:

```
DXT_list -> [rec0, rec1, ..., recN]
recN -> {"id":, ...,
          "rank":, ...,
          "write_segments": ...,
          ...}
recN["write_segments"] -> [seg0, seg1, ..., segN]
segN -> {"offset": int,
         "length": int,
         "start_time": float,
         "end_time": float}
```

- the list of segments is extremely memory inefficient, with
the smallest file in the matching issue exceeding 20 GB of
physical memory in `mod_read_all_dxt_records`:

```
Line #    Mem usage    Increment  Occurrences   Line Contents
   852                                                 # fetch records
   853   92.484 MiB   18.820 MiB           1           rec = backend.log_get_dxt_record(self.log, mod, dtype=dtype)
   854 20295.188 MiB    0.773 MiB        1025           while rec != None:
   855 20295.188 MiB    0.000 MiB        1024               self.records[mod].append(rec)
   856 20295.188 MiB    0.000 MiB        1024               self.data['modules'][mod]['num_records'] += 1
   857
   858                                                     # fetch next
   859 20295.188 MiB 20201.930 MiB        1024               rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype)
```

- if we switch to NumPy arrays the memory footprint drops a lot
(see below),
and the performance informally seems similar (36 seconds
vs. 33 seconds on `main` to produce a `report` object
with smallest file in matching issue):

```
Line #    Mem usage    Increment  Occurrences   Line Contents
   859 3222.547 MiB 3146.344 MiB        1024               rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype)
```

- this branch currently uses NumPy record arrays,
because I thought they'd be a better fit for a data
structure with 2 int columns and 2 float columns;
however, there is a big performance hit over
regular NumPy arrays (almost 6 minutes vs. 33
seconds for the smallest file in matchin issue);
so, if we could live without the extra dtype
structuring of a recarray, maybe that would be best
(we could also try to use a pandas dataframe, which
is another natural fit for dtype columns..)
@tylerjereddy tylerjereddy linked a pull request Jul 27, 2022 that will close this issue
@tylerjereddy
Copy link
Collaborator

See gh-784 for a draft approach with improved memory efficiency. There is a detailed discussion there about performance tradeoffs, so I may need to walk back the decision to use record arrays.

@tylerjereddy
Copy link
Collaborator

Keep in mind that basically all of the approaches I describe there involve changing how we store DXT data, so you may need to adjust for that. Anyway, I think the prospects for improvement on the Python side are pretty solid based on the draft code there, especially if we can revert back to a NumPy array instead of the slower recarrays.

tylerjereddy added a commit to tylerjereddy/darshan that referenced this issue Aug 23, 2022
Fixes darshan-hpc#779

* at the moment on `main`, DXT record data is effectively
stored as a list of dictionaries of lists of dictionaries
that look like this:

```
DXT_list -> [rec0, rec1, ..., recN]
recN -> {"id":, ...,
          "rank":, ...,
          "write_segments": ...,
          ...}
recN["write_segments"] -> [seg0, seg1, ..., segN]
segN -> {"offset": int,
         "length": int,
         "start_time": float,
         "end_time": float}
```

- the list of segments is extremely memory inefficient, with
the smallest file in the matching issue exceeding 20 GB of
physical memory in `mod_read_all_dxt_records`:

```
Line #    Mem usage    Increment  Occurrences   Line Contents
   852                                                 # fetch records
   853   92.484 MiB   18.820 MiB           1           rec = backend.log_get_dxt_record(self.log, mod, dtype=dtype)
   854 20295.188 MiB    0.773 MiB        1025           while rec != None:
   855 20295.188 MiB    0.000 MiB        1024               self.records[mod].append(rec)
   856 20295.188 MiB    0.000 MiB        1024               self.data['modules'][mod]['num_records'] += 1
   857
   858                                                     # fetch next
   859 20295.188 MiB 20201.930 MiB        1024               rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype)
```

- if we switch to NumPy arrays the memory footprint drops a lot
(see below),
and the performance informally seems similar (36 seconds
vs. 33 seconds on `main` to produce a `report` object
with smallest file in matching issue):

```
Line #    Mem usage    Increment  Occurrences   Line Contents
   859 3222.547 MiB 3146.344 MiB        1024               rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype)
```

- this branch currently uses NumPy record arrays,
because I thought they'd be a better fit for a data
structure with 2 int columns and 2 float columns;
however, there is a big performance hit over
regular NumPy arrays (almost 6 minutes vs. 33
seconds for the smallest file in matchin issue);
so, if we could live without the extra dtype
structuring of a recarray, maybe that would be best
(we could also try to use a pandas dataframe, which
is another natural fit for dtype columns..)
tylerjereddy added a commit to tylerjereddy/darshan that referenced this issue Sep 19, 2022
Fixes darshan-hpc#779

* at the moment on `main`, DXT record data is effectively
stored as a list of dictionaries of lists of dictionaries
that look like this:

```
DXT_list -> [rec0, rec1, ..., recN]
recN -> {"id":, ...,
          "rank":, ...,
          "write_segments": ...,
          ...}
recN["write_segments"] -> [seg0, seg1, ..., segN]
segN -> {"offset": int,
         "length": int,
         "start_time": float,
         "end_time": float}
```

- the list of segments is extremely memory inefficient, with
the smallest file in the matching issue exceeding 20 GB of
physical memory in `mod_read_all_dxt_records`:

```
Line #    Mem usage    Increment  Occurrences   Line Contents
   852                                                 # fetch records
   853   92.484 MiB   18.820 MiB           1           rec = backend.log_get_dxt_record(self.log, mod, dtype=dtype)
   854 20295.188 MiB    0.773 MiB        1025           while rec != None:
   855 20295.188 MiB    0.000 MiB        1024               self.records[mod].append(rec)
   856 20295.188 MiB    0.000 MiB        1024               self.data['modules'][mod]['num_records'] += 1
   857
   858                                                     # fetch next
   859 20295.188 MiB 20201.930 MiB        1024               rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype)
```

- if we switch to NumPy arrays the memory footprint drops a lot
(see below),
and the performance informally seems similar (36 seconds
vs. 33 seconds on `main` to produce a `report` object
with smallest file in matching issue):

```
Line #    Mem usage    Increment  Occurrences   Line Contents
   859 3222.547 MiB 3146.344 MiB        1024               rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype)
```

- this branch currently uses NumPy record arrays,
because I thought they'd be a better fit for a data
structure with 2 int columns and 2 float columns;
however, there is a big performance hit over
regular NumPy arrays (almost 6 minutes vs. 33
seconds for the smallest file in matchin issue);
so, if we could live without the extra dtype
structuring of a recarray, maybe that would be best
(we could also try to use a pandas dataframe, which
is another natural fit for dtype columns..)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants