-
Notifications
You must be signed in to change notification settings - Fork 2
Trouble Shooting
Use Ctrl + C
Use Ctrl + D
or type exit
Type exit
See all your job ids using (replace <netid>
with your own)
squeue -u <netid>
Find the <jobid>
, and then cancel that job using
scancel <jobid>
Sometimes you might be kicked off from hpc by accident, and lost everything when you login back. In most cases, your interactive CPU/GPU jobs will be killed, and you have to start everything again. However, you can try your luck to jump back to your cpu/gpu node by:
See all your job ids using (replace <netid>
with your own)
squeue -u <netid>
Find the <node-name>
under NODELIST
, and then log back in using
ssh <node-name>
If you encounter the following error
FATAL: while loading overlay images: failed to open overlay image ...
It's most probably because you are trying to open the singularity environment while another process is writing the overlay.
There are two possible situations:
You have to kill that job.
type (replace <netid>
with your own)
squeue -u <netid>
Find the <jobid>
, and then kill that job using
scancel <jobid>
You do have to require for the CPU/GPU again, but that's the only way to solve this issue.
Unfortunately you have to delete the overlay and start again.
If you set up the singularity with my bash script, you can simply reset the singularity environemnt by rerunning the script
./chslauncher.sh
and select reinstall singularity env
. However, you do have to reinstall the python packages you installed before.
The command for opening the singularity environment is:
singularity exec --nv --bind $data_dir --overlay $overlay:rw $singularity_file /bin/bash
, here:rw
stands forRead and Write
mode, and you can change it to:ro
forRead only
mode.
An singularity overlay could only be opened in multiple processes using
Read Only
mode.
The recommended way is to open the overlay in
Read and Write
mode in CPU/GPU node only when you want to install python packages, and open inRead only
mode in all other cases.
If you see warning like
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
when trying to login HPC, don't be alarmed. That's an issue caused by HPC having multiple login nodes. Check the offical guide for ssh configuration if you are interested.
To avoid this, simply open ssh config file at ~/.ssh/config
on your local machine with your favorite text editor (you might need to create a config file if it doesn't exist), and add the following lines:
Host *.hpc.nyu.edu
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
LogLevel ERROR
Now you can login HPC without seeing the warning.
- Check disk storage If you encounter the problem of "Disk quota exceeded", don't panic, you can check your disk storage by:
myquota
- Check the size of each folder
du -h --max-depth 1
- Check file numbers in each folder
for d in $(find $(pwd) -maxdepth 1 -mindepth 1 -type d | sort -u); do n_files=$(find $d | wc -l); echo $d $n_files; done
Delete files in $Home
(most probably python packages), then you will be fine
If you successfully logged in HPC through vscode before, but now you cannot it is most probably because of "Disk quota exceeded". You can login with terminal and check disk quota using myquota
.
If use find .vscode-server
contains a lot of files, then delete them and you will be fine:
rm -rf .vscode-server
The reason is mainly because you installed too many extensions in vscode. After delecting .vscode-server
, all extensions on HPC vscode will be removed (it won't affect vscode on your local machine).
When encounting error of "CUDA out of memory", you might want to monitor the GPU memory usage. You can check the GPU memory usage by:
nvidia-smi
or watch in realtime using
watch -n 1 nvidia-smi
I personally recommend logging in the same GPU node using two terminals, one for running your code, the other for monitoring the GPU memory usage. It's easy to do in both terminal and vscode integrated terminal.
Check this for how to jump to the desired node.
ssh <node-name>
top -u $USER
If you encounter the error of SSL certificate problem: unable to get local issuer certificate
, you can try to use the following command in terminal to address this problem:
export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt
You could add this line to your ~/.bashrc
file to make it permanent.
If you have used run_setup.sh
in lazy launcher script, then you should never encounter this problem. However, if you do, please raise an issue to let me know.
If commands like myquota
, squeue
, scancel
could not be executed, then probably you are inside the singularity environment. You can only use these commands outside the singularity environment.
If activated your conda environment inside singularity using source /ext3/env.sh
, and typed which conda
, but nothing shows up, then something probably went wrong during the installation.
If you have the chance to see the error message during installation, and you find ERROR: cannot verify repo.continuum.io's certificate
, then you can try the following:
Rerun the script ./chslauncher.sh
, and select "reinstall conda inside the singularity"
to reinstall conda.
If the problem still exists, try to run the script by
./chslauncher.sh --no-check-certificate
Select "reinstall conda inside the singularity"` to reinstall conda.
Note that the --no-check-certificate
flag is not recommended for security reasons, but it's the only way to bypass the certificate verification in singularity so far. I've double checked that the downloaded link for miniconda is the same as that in the official guide, so it should be safe to use.