-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overwallclock fix #2041
overwallclock fix #2041
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2041 +/- ##
==========================================
+ Coverage 48.93% 49.80% +0.87%
==========================================
Files 72 72
Lines 17543 17569 +26
Branches 3416 3418 +2
==========================================
+ Hits 8584 8751 +167
+ Misses 8154 8007 -147
- Partials 805 811 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without tests I think it'd need some more thorough review. I started reading parts of the change (really liked moving the logic of dates for jobs to the Job
class, seems to clean the code a bit 🥇), but reading it quickly I found at least one place that looks like it could raise a TypeError
(see comment).
Hello @kinow I did a manual test ( was preparing a yaml file to pass to you) with wrappers and it fails because the wallclock is not stored in the wrappers db.. preparing a fix Then I'll revise the feedback you mentioned |
Also, I think it doesn't only affect this, but also the is_wrapper value. I believe that experiments with wrappers wouldn't be able to do: module load autosubmit/4.1.11 |
I don't have time to write the pytests ( in fact, I'm already in over hours ) Neither the changelog ( apologies for that, this issue was unexpected) @kinow , I propose two options:
To test it
CONFIG:
# Current version of autosubmit.
AUTOSUBMIT_VERSION: "4.1.12"
# Maximum number of jobs permitted in the waiting status.
MAXWAITINGJOBS: 20
# Total number of jobs in the workflow.
TOTALJOBS: 20
SAFETYSLEEPTIME: 0
MAIL:
NOTIFICATIONS: False
TO:
STORAGE:
TYPE: pkl
COPY_REMOTE_LOGS: true
EXPERIMENT:
DATELIST: 20221101
MEMBERS: fc0
CHUNKSIZEUNIT: month
CHUNKSIZE: 2
NUMCHUNKS: 3
CHUNKINI: ''
CALENDAR: standard
PROJECT:
# Type of the project.
PROJECT_TYPE: none
# Folder to hold the project sources.
PROJECT_DESTINATION: git_project
GIT:
PROJECT_ORIGIN: "blabla"
PROJECT_BRANCH: "blabla"
PROJECT_COMMIT: ''
PROJECT_SUBMODULES: ''
FETCH_SINGLE_BRANCH: true
PLATFORMS:
MARENOSTRUM5:
TYPE: slurm
HOST: glogin1.bsc.es
PROJECT: bsc32
USER: "FILL"
QUEUE: gp_debug
SCRATCH_DIR: /gpfs/scratch
ADD_PROJECT_TO_HOST: false
MAX_WALLCLOCK: 02:00
TEMP_DIR: ''
MAX_PROCESSORS: 99999
PROCESSORS_PER_NODE: 112
MARENOSTRUM5-login:
TYPE: ps
HOST: glogin2.bsc.es
PROJECT: bsc32
USER: "FILL"
SCRATCH_DIR: /gpfs/scratch
ADD_PROJECT_TO_HOST: false
MAX_WALLCLOCK: 02:00
TEMP_DIR: ''
MAX_PROCESSORS: 99999
JOBS:
mn5-ps:
SCRIPT: |
sleep 180
echo "Hello World with id=Success" # this is not in the log. # Complete file doesn't exists.
RUNNING: once
wallclock: 00:01 # should be 78 seconds
PLATFORM: MARENOSTRUM5-login
mn5-slurm:
SCRIPT: |
sleep 180
echo "Hello World with id=Success" # this is not in the log.
RUNNING: once
wallclock: 00:01 # should be 78 seconds == self._wallclock_in_seconds == 78s
PLATFORM: MARENOSTRUM5
local:
SCRIPT: |
sleep 180
echo "Hello World with id=Success" # this is not in the log.
RUNNING: once
wallclock: 00:01 # should be 78 seconds == self._wallclock_in_seconds == 78s
PLATFORM: LOCAL
mn5wrapper:
SCRIPT: |
sleep 180
echo "Hello World with id=Success" # this is not in the log.
DEPENDENCIES:
mn5wrapper-1:
RUNNING: chunk
wallclock: 00:01 # should be 78(seconds)*·3(chunks) == self._wallclock_in_seconds == 234s
PLATFORM: MARENOSTRUM5
wrappers:
mn5wrapper:
JOBS_IN_WRAPPER: mn5wrapper
TYPE: vertical
ABOUT THE WRAPPERS: There was an additional issue: whenever I stopped and rerun with an auto-submit run -> auto-submit run. The wallclock value was not stored, so the self._wallclock_in_seconds was reset to 0. So, to test that part, you need to stop the experiment once the wrapper job is running and then run again. There was another issue related to the packed value not being reset on a reload, so the preview (-cw) was not working well after a wrapper was run once. |
Btw I think that the wrapper issue could be related to #2027 |
About this, I had to add a new field to the wrapper database. This database is re-created on each run, so the only thing not compatible is the run -> run -v. I've also left comments in the last commits |
Hello @kinow Did you review this one? #2041 (comment) . I did a question there although it might be obsolete now |
Not yet, @dbeltrankyl . With this one the failure happened in the testing suite only because of slowness in MN5's GPFS, and the experiment stopped. The main issue was the remote task continued running (sometimes to completion depending on the remote wallclock, I think?). So this issue seemed to have lower priority than the one in autosubmit-config-parser for @franra9 's apps workflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had left a few files pending review the last time, and stopped after sending the last feedback.
This time I finished reading the whole code. Left some comments/questions, but the branch also seems to have a Git conflict and the coverage report (last comment of the bot, at least) is pointing to some parts of the code not tested, @dbeltrankyl .
Normally that's OK, but if it's not too hard to test those lines too, then it'd reduce the risk for merging it prior to the d-suite 💪
Cheers
07ddb52
to
f9deab8
Compare
Rebased, feedback commented. Waiting for coverage and write test |
Hello @kinow I've added some tests. I had to rework the function that checks the status of the wrapper after a load because it was not right ( discovered after doing the test to test the wallclock part) I also added a function to properly test the overclock function. There is uncovered stuff related to the calls to that function because it is in a spot hard to easily test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbeltrankyl I think there are some conflicts in Git here.. can you have a look whenever you have time, please?
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
14 out of 20 files reviewed. Will try to finish it tomorrow. Submitting the review of what I have now. Thanks @dbeltrankyl for adding new configs, adjusting the code, adding docs, and for the many new tests you added over past days!
Thanks Bruno! Feedback addressed I believe |
Ok, resolved all discussions here, synced the code, and finished replying all my pending notifications (slack/email/github/mattermost). Now will eat something, and then focus on the pending code to review here, before a final test (with a simple experiment + ps platform and create+inspect to confirm before I got wallclock*60, with the branch I get the wallclock seconds as expected). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions, nothing major, @dbeltrankyl . Dani, I checked if the option to rebase here was available as Irene's PR for notifications was just merged, and it's complaining that this PR cannot be rebased.
I'm afraid you may have one more rebase/conflicts to fix 😬
I can check this again, and then do the final test with create
+ inspect
, but this looks pretty much ready for the testing suite + d-suite tests 🎉 👏 Thanks Dani!!!
Thanks @kinow Feedback resolved |
Left one question about an unused import, but didn't find anything else to comment in the code 🎉 going to run some autosubmit create + inspect now, to verify the issue is indeed fixed, and check coverage. Thanks Dani! |
Using master first. Created a dummy experiment, changed to project type, set some dummy scripts, a ps platform pointing to localhost, and then used wallclock of 02:00 in LOCAL_SETUP. Created and inspected, all good, then run shows the timeout command in aslogs run log. 2025-01-23 09:16:38,619 Submitting job with the command: timeout 432000 nohup bash /tmp/bdepaula/a00e/LOG_a00e/a00e_LOCAL_SETUP.cmd > /tmp/bdepaula/a00e/LOG_a00e/a00e_LOCAL_SETUP.cmd.out.0 2> /tmp/bdepaula/a00e/LOG_a00e/a00e_LOCAL_SETUP.cmd.err.0 & echo $!
2025-01-23 09:16:38,621 Job: a00e_LOCAL_SETUP submitted with job_id: 12637 So two hours became 432000 seconds, instead of 7200. Using this branch. I tried [WARNING] Structures folder not found /home/bdepaula/autosubmit/metadata/structures
Looking for edgeless jobs...
Starting to generate cmd scripts
[WARNING] There is a .lock file and not -f, generating only all unsubmitted cmd scripts
[WARNING] Generating the auxiliary job_list used for the -CW flag.
[WARNING] Aux Job_list was generated successfully
Calculating possible ready jobs for FAKE-LOCAL
[CRITICAL] Unexpected error: 'Job' object has no attribute '_wallclock_in_seconds'.
Please report it to Autosubmit Developers through Git
More info at https://autosubmit.readthedocs.io/en/master/troubleshooting/error-codes.html I think users would have to Did an This time I got quite close to what I expected, but exactly what's expected: 2025-01-23 09:29:08,526 Submitting job with the command: timeout 9360 nohup bash /tmp/bdepaula/a00e/LOG_a00e/a00e_LOCAL_SETUP.cmd > /tmp/bdepaula/a00e/LOG_a00e/a00e_LOCAL_SETUP.cmd.out.0 2> /tmp/bdepaula/a00e/LOG_a00e/a00e_LOCAL_SETUP.cmd.err.0 & echo $!
2025-01-23 09:29:08,528 Job: a00e_LOCAL_SETUP submitted with job_id: 13817 I expected 7200, but in reality we add 30% more in Autosubmit. So 7200 * 1.3, 9360. 👌 |
if not hasattr(self, "_wallclock_in_seconds"): # Added in 4.1.12
self._wallclock_in_seconds = None
self.wallclock = self.wallclock # also sets the wallclock in seconds
Added this, the self.wallclock = self.wallclock is to set the wallclock_in_seconds to the expected value and the first none is because if not the assign complains that the wallclock_in_seconds doesn't exists. |
The only "issue" is that my pycharm complains that it is a useless self-assignment, but it is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines not covered do not appear to be critical. Code looks good, tested and worked well. New code looks good and appears to fix the issue when the variable does not exist when loaded from job_list/disk. Great job, Dani!
Fixes:
Issues detected:
Work done:
Todo: