Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Samples Encounter ForkProcess Empty Issues with No Output Using run_pangenome_aware_deepvariant #926

Open
EEEdyeah opened this issue Jan 19, 2025 · 4 comments

Comments

@EEEdyeah
Copy link

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.8/docs/FAQ.md:
yes
Describe the issue:
(A clear and concise description of what the issue is.)
I was running deepvariant_pangenome_aware_deepvariant on vg Giraffe-mapped BAM files. However, part of the sample encountered a Process ForkProcess issue. It didn’t throw an error, didn’t terminate properly, and produced no output files.
Setup

  • Operating system: slurm

  • DeepVariant version: 1.8.0

  • Installation method (Docker, built from source, etc.): singularity pull

  • Type of data: (sequencing instrument, reference genome, anything special that is unlike the case studies?)
    Illumina human 30x WGS, vg Giraffe-mapped HPRC
    Steps to reproduce:

  • Command:
    singularity exec -B /path/:/path/ /path/deepvariant_pangenome_aware_deepvariant-1.8.0.sif /opt/deepvariant/bin/run_pangenome_aware_deepvariant
    --model_type=WGS
    --ref=/path/HPRC.GRCh38.reordered.fa
    --reads=/path/$sample_name.surject.GRCh38.sorted.dedup.lefted.realigned.bam
    --num_shards=4
    --sample_name_reads=$sample_name
    --output_vcf /path/$sample_name.deepvariant.vcf.gz
    --output_gvcf /path/$sample_name.deepvariant.gvcf.gz
    --pangenome /path/HPRC_graph.gbz
    --sample_name_pangenome HPRC
    --regions chr6:28000000-35000000
    --disable_small_model
    --intermediate_results_dir /path/dpvariant

  • Error trace: (if applicable)
    The logs indicate the program was running normally until encountering the following issues:
    2025-01-18 22:43:10.537301: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1';
    2025-01-18 22:43:10.537341: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
    I0118 22:43:10.537735 47448200671232 call_variants.py:918] call_variants: env = {'BASH_FUNC_module()': '() { eval /usr/bin/modulecmd bash $*\n}', 'SH
    I0118 22:43:10.659484 47448200671232 call_variants.py:785] Total 1 writing processes started.
    I0118 22:43:10.661774 47448200671232 call_variants.py:796] Use saved model: True
    I0118 22:43:10.665955 47448200671232 dv_utils.py:325] From /path/dpvariant/make_examples_pangenome_aware_dv.t
    I0118 22:43:21.476414 47448200671232 dv_utils.py:325] From /opt/models/pangenome_aware_deepvariant/wgs/example_info.json: Shape of input examples: [200,
    I0118 22:43:21.476675 47448200671232 call_variants.py:814] example_shape: [200, 221, 7]
    Process ForkProcess-1:
    Traceback (most recent call last):
    File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
    File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    File "/tmp/Bazel.runfiles_yqt9b630/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 551, in post_processing
    item = output_queue.get(timeout=180)
    File "/usr/lib/python3.10/multiprocessing/queues.py", line 114, in get
    raise Empty
    _queue.Empty
    I0118 22:46:46.215257 47448200671232 call_variants.py:891] Predicted 1024 examples in 1 batches [19.962 sec per 100].
    I0118 23:42:47.613373 47448200671232 call_variants.py:967] Complete: call_variants.

Does the quick start test work on your system?
Yes, the quick start test works, and most of the samples finish normally.

Any additional context:
Initially, I thought the issue was caused by the small model, so I added the --disable_small_model parameter. While this allowed some samples to run successfully, the same issue persists for other samples.

@kishwarshafin
Copy link
Collaborator

Hi @EEEdyeah , can you please run it on entire chr6 to see if the issue persists?

@EEEdyeah
Copy link
Author

@kishwarshafin Hi, I will try and it's still running. In the meantime, I found that when I reran the same code (chr6:28000000-35000000), part of the previously failed sample ran successfully. This suggests that the same code can produce different results, which makes me question the stability of the previously successful runs?

@kishwarshafin
Copy link
Collaborator

@EEEdyeah are you running on a system that pauses the processes? It seems like in your run, call variants was paused and the queue did not receive anything for 180 secs which is why it got killed. Can you try by setting num cpus to 0 from the command line and see if it still gets killed.

@EEEdyeah
Copy link
Author

@kishwarshafin Sorry for the late reply. I’m not entirely sure what caused the issue, but I think I’ve found a solution. Running each job on a separate node seems to prevent the error from occurring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants