Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting memory issue #1584

Open
francicco opened this issue Jan 7, 2025 · 3 comments
Open

Requesting memory issue #1584

francicco opened this issue Jan 7, 2025 · 3 comments

Comments

@francicco
Copy link

Hi,

I'm getting errors regarding the memory I'm using/requesting with cactus.

I submitted a job with the following options:

#SBATCH --nodes=1
#SBATCH --mem=500G
#SBATCH --ntasks=50

and I'm executing cactus wit these options:
--maxMemory 490G --binariesMode local --stats --logFile cactus.log --defaultCores 50 --defaultMemory 490G

... and I'm getting this error:

Traceback (most recent call last):
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/bin/cactus", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/cactus/progressive/cactus_progressive.py", line 455, in main
    hal_id = toil.start(Job.wrapJobFn(progressive_workflow, options, config_node, mc_tree, og_map, input_seq_id_map))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/common.py", line 930, in start
    return self._runMainLoop(rootJobDescription)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/common.py", line 1417, in _runMainLoop
    jobCache=self._jobCache).run()
                             ^^^^^
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/leader.py", line 262, in run
    self.innerLoop()
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/leader.py", line 765, in innerLoop
    self._processReadyJobs()
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/leader.py", line 658, in _processReadyJobs
    self._processReadyJob(message.job_id, message.result_status)
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/leader.py", line 574, in _processReadyJob
    self._runJobSuccessors(job_id)
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/leader.py", line 461, in _runJobSuccessors
    self.issueJobs(successors)
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/leader.py", line 941, in issueJobs
    self.issueJob(job)
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/leader.py", line 918, in issueJob
    jobBatchSystemID = self.batchSystem.issueBatchJob(' '.join(workerCommand), jobNode, job_environment=job_environment)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/batchSystems/singleMachine.py", line 759, in issueBatchJob
    self.check_resource_request(scaled_desc)
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/batchSystems/singleMachine.py", line 509, in check_resource_request
    raise e
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/batchSystems/singleMachine.py", line 505, in check_resource_request
    super().check_resource_request(requirer)
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/batchSystems/abstractBatchSystem.py", line 371, in check_resource_request
    raise e
  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/batchSystems/abstractBatchSystem.py", line 364, in check_resource_request
    raise InsufficientSystemResources(requirer, resource, available)
toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job 'RedMaskJob' kind-RedMaskJob/instance-ykf7av00 v1 is requesting 490104422400 bytes of memory, more than the maximum of 490000000000 bytes of memory that SingleMachineBatchSystem was configured with, or enforced by --maxMemory. Scale is set to 1.

What am I doing wrong? what is the difference between '--maxMemory' and --defaultMemory? Will I get the same kind of error for --defaultCores and --maxCores?

On a related problem. Would be possible to run cactus on multiple nodes? If so, how should I set these options?

Thanks a lot
F

@glennhickey
Copy link
Collaborator

This is a very strange bug in Cactus and/or Toil where 10G seems to be interpreted as a slightly smaller value in --defaultMemory than --maxMemory. The work-around is fairly simple: just don't specify --defaultMemory (you generally don't need this option and certainly never want to set it so high).

To run on multiple nodes:

https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/progressive.md#running-on-a-cluster

@francicco
Copy link
Author

Thanks a lot Glenn!! The --maxMemory is considered to be per node?

@francicco
Copy link
Author

Hi @glennhickey,

I'm having more resource-related problems. Attached you can find the log of cactus.
Here a piece of if:

[2025-01-20T11:16:03+0000] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'run_lastz' kind-run_lastz/instance-a5cbmrht v24
Exit reason: FAILED
[2025-01-20T11:16:03+0000] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'run_lastz' kind-run_lastz/instance-a5cbmrht v25
[2025-01-20T11:16:03+0000] [MainThread] [W] [toil.leader] Log from job "kind-run_lastz/instance-a5cbmrht" follows:
=========>
	[2025-01-20T10:16:52+0000] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
	[2025-01-20T10:16:52+0000] [MainThread] [I] [toil] Running Toil version 7.0.0-d569ea5711eb310ffd5703803f7250ebf7c19576 on host bp1-compute101.data.bp.acrc.priv.
	[2025-01-20T10:16:52+0000] [MainThread] [I] [toil.worker] Working on job 'run_lastz' kind-run_lastz/instance-a5cbmrht v24
	[2025-01-20T10:16:55+0000] [MainThread] [I] [toil.worker] Loaded body Job('run_lastz' kind-run_lastz/instance-a5cbmrht v24) from description 'run_lastz' kind-run_lastz/instance-a5cbmrht v24
	[2025-01-20T10:16:56+0000] [MainThread] [I] [toil.statsAndLogging] For distance 0.024033 for genomes files/for-job/kind-make_chunked_alignments/instance-gh2_kphx/cleanup/file-b65fb2f6df1440fa8744814ccb7501f6/2.fa, files/for-job/kind-make_chunked_alignments/instance-gh2_kphx/cleanup/file-e3ab33914c3340d69fb80c2472820e64/1.fa using --step=2 --ambiguous=iupac,100,100 --ydrop=3000 --notransition lastz parameters
	[2025-01-20T10:16:56+0000] [MainThread] [I] [cactus.shared.common] Running the command ['lastz', 'Mmim_2.fa[multiple][nameparse=darkspace]', 'Mmex_1.fa[nameparse=darkspace]', '--format=paf:minimap2', '--step=2', '--ambiguous=iupac,100,100', '--ydrop=3000', '--notransition']
	[2025-01-20T10:16:56+0000] [MainThread] [I] [toil-rt] 2025-01-20 10:16:56.318379: Running the command: "lastz Mmim_2.fa[multiple][nameparse=darkspace] Mmex_1.fa[nameparse=darkspace] --format=paf:minimap2 --step=2 --ambiguous=iupac,100,100 --ydrop=3000 --notransition"
	[2025-01-20T11:16:01+0000] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
	[2025-01-20T11:16:01+0000] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-make_chunked_alignments/instance-gh2_kphx/cleanup/file-b65fb2f6df1440fa8744814ccb7501f6/2.fa' to path '/tmp/toilwf-1c4fd9c0d2315096957d7f0b6d3caf53/96b7/job/tmpgxx8djcl/Mmim_2.fa'
	[2025-01-20T11:16:01+0000] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-make_chunked_alignments/instance-gh2_kphx/cleanup/file-e3ab33914c3340d69fb80c2472820e64/1.fa' to path '/tmp/toilwf-1c4fd9c0d2315096957d7f0b6d3caf53/96b7/job/tmpgxx8djcl/Mmex_1.fa'
	[2025-01-20T11:16:01+0000] [MainThread] [C] [toil.worker] Worker crashed with traceback:
	Traceback (most recent call last):
	  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/worker.py", line 438, in workerScript
	    job._runner(jobGraph=None, jobStore=job_store, fileStore=fileStore, defer=defer)
	  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/job.py", line 2984, in _runner
	    returnValues = self._run(jobGraph=None, fileStore=fileStore)
	                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/job.py", line 2895, in _run
	    return self.run(fileStore)
	           ^^^^^^^^^^^^^^^^^^^
	  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/toil/job.py", line 3158, in run
	    rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
	             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/cactus/paf/local_alignment.py", line 67, in run_lastz
	    kegalign_messages = cactus_call(parameters=lastz_cmd, outfile=alignment_file, work_dir=work_dir, returnStdErr=gpu>0, gpus=gpu,
	                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	  File "/user/work/tk19812/software/cactus-bin-v2.9.3/venv-cactus-v2.9.3/lib/python3.12/site-packages/cactus/shared/common.py", line 818, in cactus_call
	    output, stderr = process.communicate(stdin_string.encode() if first_run and stdin_string else None, timeout=10)
	                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	  File "/software/local/languages/miniforge3/envs/bioconda/lib/python3.12/subprocess.py", line 1209, in communicate
	    stdout, stderr = self._communicate(input, endtime, timeout)
	                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	  File "/software/local/languages/miniforge3/envs/bioconda/lib/python3.12/subprocess.py", line 2115, in _communicate
	    ready = selector.select(timeout)
	            ^^^^^^^^^^^^^^^^^^^^^^^^
	  File "/software/local/languages/miniforge3/envs/bioconda/lib/python3.12/selectors.py", line 415, in select
	    fd_event_list = self._selector.poll(timeout)
	                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	KeyboardInterrupt
	
	[2025-01-20T11:16:01+0000] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host bp1-compute101.data.bp.acrc.priv
<=========
[2025-01-20T11:16:03+0000] [MainThread] [W] [toil.job] Due to failure we are reducing the remaining try count of job 'run_lastz' kind-run_lastz/instance-a5cbmrht v25 with ID kind-run_lastz/instance-a5cbmrht to 1
[2025-01-20T11:16:03+0000] [MainThread] [I] [toil.leader] Issued job 'run_lastz' kind-run_lastz/instance-a5cbmrht v26 with job batch system ID: 10 and disk: 2.0 Gi, memory: 2.0 Gi, cores: 1, accelerators: [], preemptible: True
[2025-01-20T11:16:03+0000] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'run_lastz' kind-run_lastz/instance-bui1fhih v21
Exit reason: FAILED

What I can understand is that it seems to be and insufficient cores and disk space for the specified parameters:
• Requested cores: 100, but only 40 are available.
• Requested disk space: 500 GB, but only ~63 GB is available.

I resquested:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=40
#SBATCH --mem=500G

and I'm running cactus with the following options:
--maxMemory 100G --batchSystem slurm --defaultPreemptable --maxCores 100 --consCores 20 --doubleMem true --batchLogsDir /tmp

what flags should I use to adjust the run?
Cheers
F

AntCactus.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants