Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Payu runs input checksums on every run when submitting with -n N #526

Open
Whyborn opened this issue Oct 2, 2024 · 3 comments
Open

Payu runs input checksums on every run when submitting with -n N #526

Whyborn opened this issue Oct 2, 2024 · 3 comments
Labels

Comments

@Whyborn
Copy link
Contributor

Whyborn commented Oct 2, 2024

Payu re-submissions in a -n N run job trigger re-generating the input manifest. For small jobs, this becomes a significant portion of run time (maybe this is only relevant for staged_cable jobs?). I don't think there's any reason to recompute the input manifest for subsequent runs.

@Whyborn Whyborn added the feature label Oct 2, 2024
@aidanheerdegen
Copy link
Collaborator

aidanheerdegen commented Oct 2, 2024

Payu re-submissions in a -n N run job trigger re-generating the input manifest.

payu only checks the binhash hasn't changed. This should be a fast check. How many input files are there?

For small jobs, this becomes a significant portion of run time (maybe this is only relevant for staged_cable jobs?). I don't think there's any reason to recompute the input manifest for subsequent runs.

The point of the manifests is to record everything that goes into a run. Are you adding files to the manifest that aren't actually used? Typically directories were specified in the input section in config.yaml because it was easy and compact, but it's also kinda lazy and not specific. Consequently we've moved to explicitly specifying each input file

https://github.com/ACCESS-NRI/access-om2-configs/blob/release-025deg_jra55_ryf/config.yaml#L40-L52

This has the benefit of being much more specific about what the model needs to run, also any changes to specific input files are more "atomic" and are reflected directly in the config.yaml. Also it means we're calculating hashes only for the files that are used in the simulation.

There are exceptions though, e.g. JRA-55 RYF forcing data has a heap of files, so we use a directory

https://github.com/ACCESS-NRI/access-om2-configs/blob/release-025deg_jra55_ryf/config.yaml#L33

and even more for the IAF version

https://github.com/ACCESS-NRI/access-om2-configs/blob/release-025deg_jra55_iaf/config.yaml#L33-L43

@Whyborn
Copy link
Contributor Author

Whyborn commented Oct 8, 2024

At least one of the configurations we want to support has ~1000 input files (Met forcing files which are for some reason split into single year chunks). It might well be that the original dataset is not split like this, but the user who pulled it originally did it to make it easier to write the I/O handler.

I like moving to explicitly specifying the input files (does it support glob strings)?

@aidanheerdegen
Copy link
Collaborator

At least one of the configurations we want to support has ~1000 input files (Met forcing files which are for some reason split into single year chunks). It might well be that the original dataset is not split like this, but the user who pulled it originally did it to make it easier to write the I/O handler.

You have the option of using more CPUs. It is an embarassingly parallel problem, so will scale with nCPUs.

We could create a version of binhash that reads in less of the header:

https://github.com/ACCESS-NRI/yamanifest/blob/7f9aaaddc2d31ebe1cd1b9d92ab2df349e35ba82/yamanifest/hashing.py#L86-L87

This would also need some manual testing beforehand to check if it is worth the bother, and I doubt it would make much difference (I think file ops like opening and closing have a big overhead).

I like moving to explicitly specifying the input files (does it support glob strings)?

Not currently.

Originally it was just directories, but this logic branch was added to support adding specific filepaths

https://github.com/payu-org/payu/blob/master/payu/models/model.py#L277-L285

(Note that it is slightly weird, building a mock iterator so that it can reuse the main code loop below)

I don't think it would be difficult to invert the logic, test for a directory and otherwise assume a glob and populate a list of files rather than a single file.

If you think that is useful functionality probably best to create a specific issue for it and link back to this one.

In the mean time you could emulate this functionality with symbolic links: create some directories that group your inputs in some way and make symbolic links in the sub-dirs. That way you can select out just th inputs you need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants