Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overwallclock fix #2041

Merged
merged 36 commits into from
Jan 23, 2025
Merged

overwallclock fix #2041

merged 36 commits into from
Jan 23, 2025

Conversation

dbeltrankyl
Copy link
Contributor

Fixes:

  • This aims to fix the issue with the 4.1.12-dev not being able to cancel jobs that end by wallclock.

Issues detected:

  • (is_overwallclock) Depending on what was set in the wallclock ( 02:00 (hours) or 00:20 (minutes) it was wrongly multiplying the value * 60
  • timeout was always being multiplying by 60

Work done:

  • Added a few TODO to be done in another branch so this is not too big and is not related to the main issue
  • Removed some duplicated functions and merged them to the job class.
  • Changed the self.is_overwallclock call
  • Changed the parse_time call to not give the format anymore as it is not needed and leads to bugs
  • Added self.wallclock_in_seconds property
  • Added few warnings
  • Added that in case that the wallclock *1.3 > max_wallclock, set the wallclock to the max_wallclock

Todo:

  • [ ] Pytests

@dbeltrankyl dbeltrankyl requested a review from kinow December 18, 2024 12:24
@dbeltrankyl dbeltrankyl self-assigned this Dec 18, 2024
@codecov-commenter
Copy link

codecov-commenter commented Dec 18, 2024

Codecov Report

Attention: Patch coverage is 93.06931% with 7 lines in your changes missing coverage. Please review.

Project coverage is 49.80%. Comparing base (81f4079) to head (075fc7f).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
autosubmit/autosubmit.py 84.21% 1 Missing and 2 partials ⚠️
autosubmit/platforms/paramiko_platform.py 88.88% 2 Missing ⚠️
autosubmit/job/job.py 97.56% 0 Missing and 1 partial ⚠️
autosubmit/job/job_packages.py 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2041      +/-   ##
==========================================
+ Coverage   48.93%   49.80%   +0.87%     
==========================================
  Files          72       72              
  Lines       17543    17569      +26     
  Branches     3416     3418       +2     
==========================================
+ Hits         8584     8751     +167     
+ Misses       8154     8007     -147     
- Partials      805      811       +6     
Flag Coverage Δ
fast-tests 49.80% <93.06%> (+0.87%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dbeltrankyl dbeltrankyl mentioned this pull request Dec 18, 2024
2 tasks
Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without tests I think it'd need some more thorough review. I started reading parts of the change (really liked moving the logic of dates for jobs to the Job class, seems to clean the code a bit 🥇), but reading it quickly I found at least one place that looks like it could raise a TypeError (see comment).

autosubmit/platforms/psplatform.py Show resolved Hide resolved
autosubmit/job/job.py Show resolved Hide resolved
@dbeltrankyl
Copy link
Contributor Author

Hello @kinow

I did a manual test ( was preparing a yaml file to pass to you) with wrappers and it fails because the wallclock is not stored in the wrappers db.. preparing a fix

Then I'll revise the feedback you mentioned

@dbeltrankyl
Copy link
Contributor Author

Also, I think it doesn't only affect this, but also the is_wrapper value.

I believe that experiments with wrappers wouldn't be able to do:

module load autosubmit/4.1.11
run $expid
module load autosubmit/4.1.12
run $expid -v

@dbeltrankyl
Copy link
Contributor Author

I don't have time to write the pytests ( in fact, I'm already in over hours )

Neither the changelog ( apologies for that, this issue was unexpected)

@kinow , I propose two options:

  • Deploy the beta version with this without pytests ( after a manual check )

To test it

  • minimal.yml with project_type to none

  • test.yml

CONFIG:
  # Current version of autosubmit.
  AUTOSUBMIT_VERSION: "4.1.12"
  # Maximum number of jobs permitted in the waiting status.
  MAXWAITINGJOBS: 20
  # Total number of jobs in the workflow.
  TOTALJOBS: 20
  SAFETYSLEEPTIME: 0
MAIL:
  NOTIFICATIONS: False
  TO:
STORAGE:
  TYPE: pkl
  COPY_REMOTE_LOGS: true

EXPERIMENT:
  DATELIST: 20221101
  MEMBERS: fc0
  CHUNKSIZEUNIT: month
  CHUNKSIZE: 2
  NUMCHUNKS: 3
  CHUNKINI: ''
  CALENDAR: standard
PROJECT:
  # Type of the project.
  PROJECT_TYPE: none
  # Folder to hold the project sources.
  PROJECT_DESTINATION: git_project
GIT:
  PROJECT_ORIGIN: "blabla"
  PROJECT_BRANCH: "blabla"
  PROJECT_COMMIT: ''
  PROJECT_SUBMODULES: ''
  FETCH_SINGLE_BRANCH: true
PLATFORMS:
  MARENOSTRUM5:
    TYPE: slurm
    HOST: glogin1.bsc.es
    PROJECT: bsc32
    USER: "FILL"
    QUEUE: gp_debug
    SCRATCH_DIR: /gpfs/scratch
    ADD_PROJECT_TO_HOST: false
    MAX_WALLCLOCK: 02:00
    TEMP_DIR: ''
    MAX_PROCESSORS: 99999
    PROCESSORS_PER_NODE: 112
  MARENOSTRUM5-login:
    TYPE: ps
    HOST: glogin2.bsc.es
    PROJECT: bsc32
    USER: "FILL"
    SCRATCH_DIR: /gpfs/scratch
    ADD_PROJECT_TO_HOST: false
    MAX_WALLCLOCK: 02:00
    TEMP_DIR: ''
    MAX_PROCESSORS: 99999
  
JOBS:
  mn5-ps:
    SCRIPT: |
      sleep 180
      echo "Hello World with id=Success" # this is not in the log. # Complete file doesn't exists.
    RUNNING: once
    wallclock: 00:01 # should be 78 seconds
    PLATFORM: MARENOSTRUM5-login

  mn5-slurm:
    SCRIPT: |
      sleep 180
      echo "Hello World with id=Success" # this is not in the log.
    RUNNING: once
    wallclock: 00:01 # should be 78 seconds  == self._wallclock_in_seconds == 78s
    PLATFORM: MARENOSTRUM5

  local:
    SCRIPT: |
      sleep 180
      echo "Hello World with id=Success" # this is not in the log.
    RUNNING: once
    wallclock: 00:01 # should be 78 seconds  == self._wallclock_in_seconds == 78s
    PLATFORM: LOCAL

  mn5wrapper:
    SCRIPT: |
      sleep 180 
      echo "Hello World with id=Success" # this is not in the log.
    DEPENDENCIES:
      mn5wrapper-1:
    RUNNING: chunk
    wallclock: 00:01 # should be 78(seconds)*·3(chunks) == self._wallclock_in_seconds == 234s
    PLATFORM: MARENOSTRUM5

wrappers:
  mn5wrapper:
    JOBS_IN_WRAPPER: mn5wrapper
    TYPE: vertical
  • Another person writes the pytest for this and then merges it

ABOUT THE WRAPPERS:

There was an additional issue: whenever I stopped and rerun with an auto-submit run -> auto-submit run. The wallclock value was not stored, so the self._wallclock_in_seconds was reset to 0.

So, to test that part, you need to stop the experiment once the wrapper job is running and then run again.

There was another issue related to the packed value not being reset on a reload, so the preview (-cw) was not working well after a wrapper was run once.

@dbeltrankyl
Copy link
Contributor Author

Btw I think that the wrapper issue could be related to #2027

@dbeltrankyl
Copy link
Contributor Author

Also, I think it doesn't only affect this, but also the is_wrapper value.

I believe that experiments with wrappers wouldn't be able to do:

module load autosubmit/4.1.11 run $expid module load autosubmit/4.1.12 run $expid -v

About this, I had to add a new field to the wrapper database.

This database is re-created on each run, so the only thing not compatible is the run -> run -v.

I've also left comments in the last commits

@dbeltrankyl
Copy link
Contributor Author

dbeltrankyl commented Jan 7, 2025

Hello @kinow

Did you review this one? #2041 (comment) . I did a question there although it might be obsolete now

@kinow
Copy link
Member

kinow commented Jan 7, 2025

Not yet, @dbeltrankyl . With this one the failure happened in the testing suite only because of slowness in MN5's GPFS, and the experiment stopped. The main issue was the remote task continued running (sometimes to completion depending on the remote wallclock, I think?). So this issue seemed to have lower priority than the one in autosubmit-config-parser for @franra9 's apps workflows.

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had left a few files pending review the last time, and stopped after sending the last feedback.

This time I finished reading the whole code. Left some comments/questions, but the branch also seems to have a Git conflict and the coverage report (last comment of the bot, at least) is pointing to some parts of the code not tested, @dbeltrankyl .

Normally that's OK, but if it's not too hard to test those lines too, then it'd reduce the risk for merging it prior to the d-suite 💪

Cheers

autosubmit/autosubmit.py Outdated Show resolved Hide resolved
autosubmit/autosubmit.py Outdated Show resolved Hide resolved
autosubmit/autosubmit.py Outdated Show resolved Hide resolved
autosubmit/job/job.py Outdated Show resolved Hide resolved
autosubmit/job/job.py Outdated Show resolved Hide resolved
autosubmit/job/job.py Outdated Show resolved Hide resolved
autosubmit/job/job.py Show resolved Hide resolved
autosubmit/job/job.py Show resolved Hide resolved
autosubmit/job/job.py Outdated Show resolved Hide resolved
@dbeltrankyl dbeltrankyl force-pushed the 4.1.12-overwallclock-fix branch from 07ddb52 to f9deab8 Compare January 15, 2025 10:50
@dbeltrankyl
Copy link
Contributor Author

Rebased, feedback commented. Waiting for coverage and write test

@dbeltrankyl
Copy link
Contributor Author

Hello @kinow

I've added some tests.

I had to rework the function that checks the status of the wrapper after a load because it was not right ( discovered after doing the test to test the wallclock part)

I also added a function to properly test the overclock function. There is uncovered stuff related to the calls to that function because it is in a spot hard to easily test

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbeltrankyl I think there are some conflicts in Git here.. can you have a look whenever you have time, please?

@dbeltrankyl
Copy link
Contributor Author

Done

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 out of 20 files reviewed. Will try to finish it tomorrow. Submitting the review of what I have now. Thanks @dbeltrankyl for adding new configs, adjusting the code, adding docs, and for the many new tests you added over past days!

test/unit/test_paramiko_platform.py Outdated Show resolved Hide resolved
test/unit/test_overwallclock.py Outdated Show resolved Hide resolved
autosubmit/job/job_package_persistence.py Show resolved Hide resolved
autosubmit/job/job_package_persistence.py Outdated Show resolved Hide resolved
autosubmit/job/job_package_persistence.py Outdated Show resolved Hide resolved
autosubmit/platforms/paramiko_platform.py Show resolved Hide resolved
autosubmit/platforms/paramiko_platform.py Outdated Show resolved Hide resolved
@dbeltrankyl
Copy link
Contributor Author

Thanks Bruno!

Feedback addressed I believe

@kinow
Copy link
Member

kinow commented Jan 22, 2025

Ok, resolved all discussions here, synced the code, and finished replying all my pending notifications (slack/email/github/mattermost). Now will eat something, and then focus on the pending code to review here, before a final test (with a simple experiment + ps platform and create+inspect to confirm before I got wallclock*60, with the branch I get the wallclock seconds as expected).

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions, nothing major, @dbeltrankyl . Dani, I checked if the option to rebase here was available as Irene's PR for notifications was just merged, and it's complaining that this PR cannot be rebased.

I'm afraid you may have one more rebase/conflicts to fix 😬

I can check this again, and then do the final test with create + inspect, but this looks pretty much ready for the testing suite + d-suite tests 🎉 👏 Thanks Dani!!!

autosubmit/platforms/paramiko_platform.py Outdated Show resolved Hide resolved
autosubmit/job/job.py Outdated Show resolved Hide resolved
test/unit/test_overwallclock.py Outdated Show resolved Hide resolved
test/unit/test_overwallclock.py Show resolved Hide resolved
@dbeltrankyl
Copy link
Contributor Author

Thanks @kinow

Feedback resolved

@kinow
Copy link
Member

kinow commented Jan 23, 2025

Left one question about an unused import, but didn't find anything else to comment in the code 🎉 going to run some autosubmit create + inspect now, to verify the issue is indeed fixed, and check coverage. Thanks Dani!

@kinow
Copy link
Member

kinow commented Jan 23, 2025

Using master first.

Created a dummy experiment, changed to project type, set some dummy scripts, a ps platform pointing to localhost, and then used wallclock of 02:00 in LOCAL_SETUP.

Created and inspected, all good, then run shows the timeout command in aslogs run log.

2025-01-23 09:16:38,619 Submitting job with the command: timeout 432000 nohup bash /tmp/bdepaula/a00e/LOG_a00e/a00e_LOCAL_SETUP.cmd > /tmp/bdepaula/a00e/LOG_a00e/a00e_LOCAL_SETUP.cmd.out.0 2> /tmp/bdepaula/a00e/LOG_a00e/a00e_LOCAL_SETUP.cmd.err.0 & echo $!
2025-01-23 09:16:38,621 Job: a00e_LOCAL_SETUP submitted with job_id: 12637

So two hours became 432000 seconds, instead of 7200.

Using this branch.

I tried create & inspect without the -v, and got

[WARNING] Structures folder not found /home/bdepaula/autosubmit/metadata/structures
Looking for edgeless jobs...
Starting to generate cmd scripts
[WARNING] There is a .lock file and not -f, generating only all unsubmitted cmd scripts
[WARNING] Generating the auxiliary job_list used for the -CW flag.
[WARNING] Aux Job_list was generated successfully
Calculating possible ready jobs for FAKE-LOCAL
 [CRITICAL] Unexpected error: 'Job' object has no attribute '_wallclock_in_seconds'.
 Please report it to Autosubmit Developers through Git
More info at https://autosubmit.readthedocs.io/en/master/troubleshooting/error-codes.html

I think users would have to autosubmit create -f. I think my feedback from yesterday didn't quite work -- your code had a protection against this in ... adjust parameters I think? I think we will have to have at least a self._wallclock_in_seconds = None there...

Did an autosubmit create -f, inspect, and then run.

This time I got quite close to what I expected, but exactly what's expected:

2025-01-23 09:29:08,526 Submitting job with the command: timeout 9360 nohup bash /tmp/bdepaula/a00e/LOG_a00e/a00e_LOCAL_SETUP.cmd > /tmp/bdepaula/a00e/LOG_a00e/a00e_LOCAL_SETUP.cmd.out.0 2> /tmp/bdepaula/a00e/LOG_a00e/a00e_LOCAL_SETUP.cmd.err.0 & echo $!
2025-01-23 09:29:08,528 Job: a00e_LOCAL_SETUP submitted with job_id: 13817

I expected 7200, but in reality we add 30% more in Autosubmit. So 7200 * 1.3, 9360. 👌

@dbeltrankyl
Copy link
Contributor Author

I think users would have to autosubmit create -f. I think my feedback from yesterday didn't quite work -- your code had a protection against this in ... adjust parameters I think? I think we will have to have at least a self._wallclock_in_seconds = None there...

 
        if not hasattr(self, "_wallclock_in_seconds"):  # Added in 4.1.12
            self._wallclock_in_seconds = None
            self.wallclock = self.wallclock # also sets the wallclock in seconds

Added this, the self.wallclock = self.wallclock is to set the wallclock_in_seconds to the expected value and the first none is because if not the assign complains that the wallclock_in_seconds doesn't exists.

@dbeltrankyl
Copy link
Contributor Author

I think users would have to autosubmit create -f. I think my feedback from yesterday didn't quite work -- your code had a protection against this in ... adjust parameters I think? I think we will have to have at least a self._wallclock_in_seconds = None there...

 
        if not hasattr(self, "_wallclock_in_seconds"):  # Added in 4.1.12
            self._wallclock_in_seconds = None
            self.wallclock = self.wallclock # also sets the wallclock in seconds

Added this, the self.wallclock = self.wallclock is to set the wallclock_in_seconds to the expected value and the first none is because if not the assign complains that the wallclock_in_seconds doesn't exists.

The only "issue" is that my pycharm complains that it is a useless self-assignment, but it is not

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines not covered do not appear to be critical. Code looks good, tested and worked well. New code looks good and appears to fix the issue when the variable does not exist when loaded from job_list/disk. Great job, Dani!

@dbeltrankyl dbeltrankyl merged commit 0f029bd into master Jan 23, 2025
8 checks passed
@kinow kinow deleted the 4.1.12-overwallclock-fix branch January 23, 2025 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants