Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLURM issues with latest version #197

Closed
snsansom opened this issue Jan 14, 2025 · 15 comments
Closed

SLURM issues with latest version #197

snsansom opened this issue Jan 14, 2025 · 15 comments

Comments

@snsansom
Copy link
Contributor

snsansom commented Jan 14, 2025

Hi Adam,

On the BMRC cluster with the latest version users in my and other groups at the KIR are getting errors as below.

To fix this we are checking out the last known good version with git checkout e033b36.

Can you please explain what is happening with this repo?

Why is there all this recent development activity directly on master?

What is the goal of the current work?

Has it been discussed with other cgat-core users?

Could work instead be done on a branch and tested including by others before being committed to master?

Very much appreciate your ongoing work to maintain and improve this code which we rely on every day.

All best,
S.

`
--- Logging error ---
Traceback (most recent call last):
File "/apps/eb/2022b/skylake/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/logging/init.py", line 440, in format
return self._format(record)
File "/apps/eb/2022b/skylake/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/logging/init.py", line 436, in _format
return self._fmt % values
KeyError: 'app_name'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/apps/eb/2022b/skylake/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/logging/init.py", line 1100, in emit
msg = self.format(record)
File "/apps/eb/2022b/skylake/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/logging/init.py", line 943, in format
return fmt.format(record)
File "/gpfs3/well/sansom/users/odq545/devel/cgat-core/cgatcore/experiment.py", line 647, in format
s = logging.Formatter.format(self, record)
File "/apps/eb/2022b/skylake/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/logging/init.py", line 681, in format
s = self.formatMessage(record)
File "/apps/eb/2022b/skylake/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/logging/init.py", line 650, in formatMessage
return self._style.format(record)
File "/apps/eb/2022b/skylake/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/logging/init.py", line 442, in format
raise ValueError('Formatting field not found in record: %s' % e)
ValueError: Formatting field not found in record: 'app_name'
Call stack:
File "/users/sansom/odq545/devel/venvs/python-3.10.8-GCCcore-12.2.0-skylake-cgat/bin/cellhub", line 33, in
sys.exit(load_entry_point('cellhub', 'console_scripts', 'cellhub')())
File "/gpfs3/well/sansom/users/odq545/devel/cellhub/cellhub/entry.py", line 104, in main
module.main(sys.argv)
File "/gpfs3/well/sansom/users/odq545/devel/cellhub/cellhub/pipeline_cellbender.py", line 353, in main
P.main(argv)
File "/gpfs3/well/sansom/users/odq545/devel/cgat-core/cgatcore/pipeline/control.py", line 1446, in main
run_workflow(args)
File "/gpfs3/well/sansom/users/odq545/devel/cgat-core/cgatcore/pipeline/control.py", line 1300, in run_workflow
ruffus.pipeline_run(
File "/users/sansom/odq545/devel/venvs/python-3.10.8-GCCcore-12.2.0-skylake-cgat/lib/python3.10/site-packages/ruffus/task.py", line 5252,
in pipeline_run
job_result = next(ii)
File "/users/sansom/odq545/devel/venvs/python-3.10.8-GCCcore-12.2.0-skylake-cgat/lib/python3.10/site-packages/ruffus/task.py", line 712, i
n run_pooled_job_without_exceptions
return_value = job_wrapper(params, user_defined_work_func,
File "/users/sansom/odq545/devel/venvs/python-3.10.8-GCCcore-12.2.0-skylake-cgat/lib/python3.10/site-packages/ruffus/task.py", line 545, i
n job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/gpfs3/well/sansom/users/odq545/devel/cellhub/cellhub/pipeline_cellbender.py", line 255, in mtx
P.run(statements, **t.resources)
File "/gpfs3/well/sansom/users/odq545/devel/cgat-core/cgatcore/pipeline/execution.py", line 1530, in run
benchmark_data = e.run(statement_list)
File "/gpfs3/well/sansom/users/odq545/devel/cgat-core/cgatcore/pipeline/executors.py", line 108, in run
self.logger.info(f"Running statement on Slurm: {statement}")
Message: 'Running statement on Slurm: python /gpfs3/well/sansom/users/odq545/devel/cellhub/python/cellbender_export_mtx.py`

@Acribbs
Copy link
Contributor

Acribbs commented Jan 14, 2025

At the moment I am the only one maintaining this and adding functionality to cgatcore, everyone else seems to have given up on it and is only interested in using it not maintaining it.

cgat-core is far behind state of the art and only really functions for older HPC type clusters or local workflows, for wider utility it requires cloud functionality, better support for docker e.c.t. and I get asked this a lot by external groups and companies using ruffus but cannot use cgatcore because of its outdated functionality. This has meant a lot have stopped using ruffus for others that have this functionality.

The latest code changes aim to significantly enhance the functionality of cgatcore, particularly in its handling of HPC environments. Previously, tests conducted up to a month ago did not cover HPC features, leaving a critical gap in validation. To address this, I have invested considerable effort in mocking components of the code, ensuring that tests for Slurm and SGE executors are now available. However, mocking alone has its limitations, as not all aspects of the system can be effectively simulated. As a result, the full functionality of the code can only be thoroughly tested on actual HPC systems, such as those provided by CCB or the AWS cloud, since I no longer utilise BMRC.

Currently everything is released on a branch and merging to the master is done after branching, testing is then additionally performed on the branch and master, no direct merging has been done: see https://github.com/cgat-developers/cgat-core/actions . In fact we have branch protection rules so you cannot commit code to the master unless it has passed tests first.

I also ask for additional testing from other users periodically when big code changes are undertaken, but currently everyone seems to ignore me apart from nick: see #182.

The biggest problem I have found with cgatcore at the moment is that the testing framework is wholly inadequate, only a fraction of the code is covered, I spent a lot of time implementing testing but things are obviously going to slip through the net occasionally.

The issue you report above looks like its to do with this line from 720-723:

logformat = '# %(asctime)s %(levelname)s %(message)s'
. The logging format was trying to access an app_name field that wasn't set in the log record. Likely changing this to logformat = '# %(asctime)s %(message)s', should solve the issue, when making the change if you could include unit tests for this it will mean that we can catch this next time.

@snsansom
Copy link
Contributor Author

snsansom commented Jan 14, 2025

Many thanks for kindly explaining Adam, this all makes sense. I'm not monitoring GitHub notifications so missed the request(s) to test but otherwise I or someone from my group would have helped - any chance of pinging a regular email in future for potentially breaking changes?

With respect to this issue, the app_name problem is to do with logger inheritance - the control.py module sets up a special logger with (a) a handler specifying a format that expects 'app_name' in the record and (b) a filter that adds 'app_name' to the record. Somehow by instantiating the logger in the executors class with the arg "__name__" the format is being inherited but not the filter which updates the records. This issue can be fixed by doing e.g. getLogger("executor.slurm") instead.

However this was masking another issue with the slurm executor as below. I suspect Nick missed this due to the previous issue where tasks were running on the head node with a local executor.

So seems there is some work left to do to get execution working again on SLURM queues. If I get some time I will dig deeper and try and fix.

` Exception #1
'builtins.RuntimeError(Failed to get job status: sacct: fatal: Bad job/step specified: Submitted
)' raised in ...
Task = def pipeline_singleR.singleR(...):
Job = [.../mtx/matrix.mtx.gz -> singleR.dir/HumanPrimaryCellAtlasData/UV174_001_PBMC.sentinel]

Traceback (most recent call last):
  File "/users/sansom/odq545/devel/venvs/python-3.10.8-GCCcore-12.2.0-skylake-cgat/lib/python3.10/site-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
    return_value = job_wrapper(params, user_defined_work_func,
  File "/users/sansom/odq545/devel/venvs/python-3.10.8-GCCcore-12.2.0-skylake-cgat/lib/python3.10/site-packages/ruffus/task.py", line 545, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/gpfs3/well/sansom/users/odq545/devel/cellhub/cellhub/pipeline_singleR.py", line 138, in singleR
    P.run(statement, **t.resources)
  File "/gpfs3/well/sansom/users/odq545/devel/cgat-core/cgatcore/pipeline/execution.py", line 1530, in run
    benchmark_data = e.run(statement_list)
  File "/gpfs3/well/sansom/users/odq545/devel/cgat-core/cgatcore/pipeline/executors.py", line 125, in run
    self.monitor_job_completion(job_id)
  File "/gpfs3/well/sansom/users/odq545/devel/cgat-core/cgatcore/pipeline/executors.py", line 151, in monitor_job_completion
    raise RuntimeError(f"Failed to get job status: {process.stderr}")
RuntimeError: Failed to get job status: sacct: fatal: Bad job/step specified: Submitted`

@snsansom snsansom changed the title Issue with latest version SLURM issues with latest version Jan 14, 2025
@snsansom
Copy link
Contributor Author

snsansom commented Jan 15, 2025

Hi @Acribbs,

So I had a look now at the new executors (https://github.com/cgat-developers/cgat-core/blob/master/cgatcore/pipeline/executors.py#L96-L180) vs the previous code (https://github.com/cgat-developers/cgat-core/blob/master/cgatcore/pipeline/cluster.py#L356-L483).

It seems that the idea is to switch away from using the DRMAA API? Also there was useful code that Andreas wrote for collecting benchmarking data in the cluster.py SlurmCluster class. We have downstream tools that use that information to profile our pipelines.

For now, given a significant proportion of the existing user base are running cgat-core on SLURM clusters (for example at the KIR we have at least 12 active users across at least 6 research groups) I'm not sure it's great for us to be in a situation where (a) cgat-core no longer works for slurm clusters and (b) we make a change from DRMAA and the task benchmarking without a wider discussion?

One option could be for us to roll the AC-Kubernates merge back for now? My group would be very happy to help with testing that branch as SLURM compatibility is improved. I might also be able to arrange access for you to the BMRC cluster for testing purposes?

All best, S.

@snsansom
Copy link
Contributor Author

Have also changed my notification settings so I should now get email notifications for mentions for this repo. Thanks, S.

@nickilott
Copy link

nickilott commented Jan 15, 2025

Hi both,

Just wanted to chip in here - I'm not sure what happened Adam but when I tested AC-Kubernetes things seemed to work fine but now I am getting the same error as Steve in a clean venv with the same pipeline. Adding an apology as an edit :)

@nickilott
Copy link

Given that the latest changes are now breaking for us using SLURM @Acribbs could we roll the changes back and make a plan going forward?

@Acribbs
Copy link
Contributor

Acribbs commented Jan 15, 2025

Indeed, the aim of the new classes are to remove dependancies such as DRMAA API, make it much easier to maintain and easier to debug as its what cluster admins would use.

The DRMAACluster class (and conda dep) is still present but that was planned to be removed once the code is more stable.

These issues are very strange SLURM issues and I suspect are due to your cluster configuration or more likely you are using an old version of SLURM. The first one I fixed by removing the levelname which will avoid accessing the app_name (Our SLURM will then not be affected). The second issue you raise is due to your SLURM software initiating a job and passing a message (likely: "Subbmitted job_ID 726253", when cgatcore picks this up is parses it and submits the job as 'Submitted' rather than the job ID. I have attempted to fix this with #198 .

Please could you take a look and test this please, as reverting the code might break the HEAD state and be difficult to implement.

I suspected that we might see some teething issues with the code, thats why I didnt release a PyPI or conda release yet. Im intrigued as to why everyone is working off the github code for cgatcore, I usually instruct people not to use development code for pipeline work and always use stable pyPI or conda releases.

@nickilott
Copy link

Thanks Adam,

I will test this branch. Your point about not using PyPi release of cgatcore is an interesting one - I have no answer and now I don't know why :)

@snsansom
Copy link
Contributor Author

snsansom commented Jan 15, 2025

Hi Adam, interested in your comments on DRMAA. My understanding was that a very serious advantage is that it can wait for responses from the queue manager rather than having to repeatedly query for an update? And IIRC this also means that if you have lots of small jobs, the pipeline can execute faster. Some of our pipelines are submitting hundreds or even thousands of jobs in parallel and I am interested on what the impact of switching away from DRMAA would be on a heavily subscribed queue manager such as is the case on the BMRC in that scenario. It was a battle and took many months to get cgat-core accepted and working smoothly on the BMRC and I am wary of switching to a different mode of job management that could cause issues!

Any thoughts re the benchmarking data that was being collected previously? Will this be possible to implement similarly without using DRMAA?

Any chance we can keep both approaches and allow the user to configure which they want to use? I think this is what snakemake does?

@snsansom
Copy link
Contributor Author

snsansom commented Jan 15, 2025

We will test the PR (many thanks for addressing the immediate concerns), but for us performance, compatibility with BMRC and ability to benchmark our pipeline tasks are also essential so we will need some time to evaluate if we can switch away from DRMAA going forward and this will need to involve some discussion with the BMRC team.

@snsansom
Copy link
Contributor Author

Also just making a note for myself of the comment here: https://snakemake.readthedocs.io/en/v6.3.0/tutorial/additional_features.html "If available, DRMAA is preferable over the generic cluster modes because it provides better control and error handling.".

We will also need to run some tests to compare how job cancellation and error handling are handled with the DRMAA approach vs the sbatch/sacct wrapper implementation.

@nickilott
Copy link

My immediate thought is that I like the idea of having the option of either approach

@Acribbs
Copy link
Contributor

Acribbs commented Jan 15, 2025

This is one of the reasons why I didn't remove the DRMAACluster class (and kept the whole of cluster.py), just in case we would need it. It not difficult to support both, but the SLURM and SGE Classes would need to be refactored. Likely I would create a DRMAAExecutor class that they could inherit, that way we keep the structure of the Executor class (so easier to maintain) and then have benefits of both. Happy to do this.

Get your point about DRMAA having lower overhead with large amounts of jobs, although I think this wouldn't be an issue with less than 100 jobs more, we did have issues many years ago with latency, but this was mainly because of low gig network switches, most modern clusters should now have higher gig so avoid this.

The main thing at the moment would be testing the config as it is now, then DRMAA Class can be added back in as an option if available, default to DRMAA and if not available work without.

@snsansom
Copy link
Contributor Author

Hi Adam, thanks - what you are suggesting - to make a default DRMAAExecutor class sounds perfect. Will let you know how the testing goes for #198 shortly.

@snsansom snsansom mentioned this issue Jan 16, 2025
@Acribbs
Copy link
Contributor

Acribbs commented Jan 18, 2025

reverted AC-kubernetes because of issues in #198 , made pull request #199 to reset changes and made branch backup-branch-kubernetes to keep history of commits.

@Acribbs Acribbs closed this as completed Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants