Skip to content

Trouble Shooting

RicercarG edited this page Jun 13, 2024 · 6 revisions

How Can I quit?

Quit a running python (or other) program in terminal

Use Ctrl + C

Exit the singularity environment

Use Ctrl + D or type exit

Exit a CPU/GPU node that you are in

Type exit

Cancel a slurm job or CPU/GPU node

See all your job ids using (replace <netid> with your own)

squeue -u <netid>

Find the <jobid>, and then cancel that job using

scancel <jobid>

How Can I jump back when kicked off by accident?

Sometimes you might be kicked off from hpc by accident, and lost everything when you login back. In most cases, your interactive CPU/GPU jobs will be killed, and you have to start everything again. However, you can try your luck to jump back to your cpu/gpu node by:

See all your job ids using (replace <netid> with your own)

squeue -u <netid>

Find the <node-name> under NODELIST, and then log back in using

ssh <node-name>

Could not open singularity environment

If you encounter the following error

FATAL:   while loading overlay images: failed to open overlay image ...

It's most probably because you are trying to open the singularity environment while another process is writing the overlay.

There are two possible situations:

1. If the overlay is opened in a CPU/GPU node:

You have to kill that job. type (replace <netid> with your own)

squeue -u <netid>

Find the <jobid>, and then kill that job using

scancel <jobid>

You do have to require for the CPU/GPU again, but that's the only way to solve this issue.

2. If the overlay is opened in a login node:

Unfortunately you have to delete the overlay and start again.

If you set up the singularity with my bash script, you can simply reset the singularity environemnt by rerunning the script

./chslauncher.sh

and select reinstall singularity env. However, you do have to reinstall the python packages you installed before.

Some Extra tips:

The command for opening the singularity environment is: singularity exec --nv --bind $data_dir --overlay $overlay:rw $singularity_file /bin/bash, here :rw stands for Read and Write mode, and you can change it to :ro for Read only mode.

An singularity overlay could only be opened in multiple processes using Read Only mode.

The recommended way is to open the overlay in Read and Write mode in CPU/GPU node only when you want to install python packages, and open in Read only mode in all other cases.

"Man-In-The-Middle" Warning

If you see warning like

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.

when trying to login HPC, don't be alarmed. That's an issue caused by HPC having multiple login nodes. Check the offical guide for ssh configuration if you are interested.

To avoid this, simply open ssh config file at ~/.ssh/config on your local machine with your favorite text editor (you might need to create a config file if it doesn't exist), and add the following lines:

Host *.hpc.nyu.edu
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null
  LogLevel ERROR

Now you can login HPC without seeing the warning.

Disk Quota Exceeded

  • Check disk storage If you encounter the problem of "Disk quota exceeded", don't panic, you can check your disk storage by:
myquota
  • Check the size of each folder
du -h --max-depth 1
  • Check file numbers in each folder
for d in $(find $(pwd) -maxdepth 1 -mindepth 1 -type d | sort -u); do n_files=$(find $d | wc -l); echo $d $n_files; done

Delete files in $Home (most probably python packages), then you will be fine

Could not login server through vscode

If you successfully logged in HPC through vscode before, but now you cannot it is most probably because of "Disk quota exceeded". You can login with terminal and check disk quota using myquota. If use find .vscode-server contains a lot of files, then delete them and you will be fine:

rm -rf .vscode-server

The reason is mainly because you installed too many extensions in vscode. After delecting .vscode-server, all extensions on HPC vscode will be removed (it won't affect vscode on your local machine).

Out of Memory Error

Check GPU Status

When encounting error of "CUDA out of memory", you might want to monitor the GPU memory usage. You can check the GPU memory usage by:

nvidia-smi

or watch in realtime using

watch -n 1 nvidia-smi

I personally recommend logging in the same GPU node using two terminals, one for running your code, the other for monitoring the GPU memory usage. It's easy to do in both terminal and vscode integrated terminal.
Check this for how to jump to the desired node.

Check CPU Status

ssh <node-name>
top -u $USER

SSL Certificate Error

If you encounter the error of SSL certificate problem: unable to get local issuer certificate, you can try to use the following command in terminal to address this problem:

export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt

You could add this line to your ~/.bashrc file to make it permanent.

If you have used run_setup.sh in lazy launcher script, then you should never encounter this problem. However, if you do, please raise an issue to let me know.

Some linux commands could not be executed

If commands like myquota, squeue, scancel could not be executed, then probably you are inside the singularity environment. You can only use these commands outside the singularity environment.

Lazy Launcher Script Related

Conda environment installation failed

If activated your conda environment inside singularity using source /ext3/env.sh, and typed which conda, but nothing shows up, then something probably went wrong during the installation.

If you have the chance to see the error message during installation, and you find ERROR: cannot verify repo.continuum.io's certificate, then you can try the following:

Rerun the script ./chslauncher.sh, and select "reinstall conda inside the singularity" to reinstall conda.

If the problem still exists, try to run the script by

./chslauncher.sh --no-check-certificate

Select "reinstall conda inside the singularity"` to reinstall conda.

Note that the --no-check-certificate flag is not recommended for security reasons, but it's the only way to bypass the certificate verification in singularity so far. I've double checked that the downloaded link for miniconda is the same as that in the official guide, so it should be safe to use.