Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Engaging Cluster documentation #33

Merged
merged 9 commits into from
Jan 8, 2025
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions docs/engaging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# MIT Engaging High Performance Compute Cluster

The Engaging High Performance Compute Cluster is available to LINC team members to process their jobs at scale, including with the use of GPUs.

## Create an account

In order to access the Engaging Cluster, you will need a MIT Sponsored Account.

1. Please contact Kabi at [email protected] with your organization name, date of birth, and phone number.
2. Once the sponsored account is approved, you will receive an email to complete account registration and establish your MIT Kerberos identity.
3. Please send your Kerberos ID to Kabi so that he can add you to the WebMoira groups (`orcd_ug_pg_linc_all` and `orcd_ug_pi_satra_all`) so that you can access the Engaging Cluster.

## Documentation overview

The MIT Office of Research Computing and Data (ORCD) manage the Engaging Cluster. Most of the information you will need is in the first link below but there are additional resources:

1. [Engaging Cluster docs](https://engaging-web.mit.edu/eofe-wiki/)
1. [ORCD Docs](https://orcd-docs.mit.edu/)
1. [MGHPCC OpenMind GitHub wiki](https://github.mit.edu/MGHPCC/OpenMind/wiki)
1. [Slurm docs](https://slurm.schedmd.com/overview.html)

## Access the cluster and run jobs

The Engaging Cluster has head/login nodes to access the cluster and submit jobs to the compute nodes which run your resource intensive scripts. Job orchestration is performed with the Slurm Workload Manager. The [Engaging Cluster Documentation](https://engaging-web.mit.edu/eofe-wiki/) provides details on these operations, including:

1. [Logging into the cluster](https://engaging-web.mit.edu/eofe-wiki/logging_in/)
1. [Cluster architecture including information on the head/login nodes versus compute nodes](https://engaging-web.mit.edu/eofe-wiki/slurm/cluster_workflow/)
1. [Common commands to interact with the Slurm Job Scheduler](https://engaging-web.mit.edu/eofe-wiki/slurm/slurm/)
1. [Run multiple jobs in parallel with `sbatch`](https://engaging-web.mit.edu/eofe-wiki/slurm/sbatch/)
1. [Run interactive jobs on a single compute node with `srun` or `salloc`](https://engaging-web.mit.edu/eofe-wiki/slurm/srun/)
1. [Access installed software](https://engaging-web.mit.edu/eofe-wiki/software/load_modules/)
1. [Determining resources for your job](https://engaging-web.mit.edu/eofe-wiki/slurm/resources/)

Slurm is a common workload manager so you can also refer to the official [Slurm documentation](https://slurm.schedmd.com/overview.html).

## Compute nodes

The Engaging Cluster has several CPU-only compute nodes and GPU compute nodes. The nodes are categorized according to partitions to control which groups have access to the nodes.

See [Determining resources for your job](https://engaging-web.mit.edu/eofe-wiki/slurm/resources/) for details on selecting the nodes and resources for your jobs. Briefly, the `sinfo` command shows the partitions where you can submit jobs and you can submit jobs to a certain partition by including `#SBATCH --partition=<partition_name>` in your sbatch script.

The GPU nodes are available through the `ou_bcs_high` and `ou_bcs_low` partitions. For more details, see the [BCS computing resources on Engaging - Slurm configuration](https://github.mit.edu/MGHPCC/OpenMind/wiki/User-guide-for-BCS-computing-resources-on-Engaging#slurm-configuration) wiki.

## Data storage

Data can be stored under the following path: `/orcd/data/linc/`. We will be working to create an organization strategy for the LINC project data but for now please store your data under a subdirectory (e.g. `/orcd/data/linc/<username>` or `/orcd/data/linc/<projectname>`). There are additional locations to store your data including the use of scratch space (`/orcd/scratch/bcs/001`, `/orcd/scratch/bcs/002`, `/pool001/<username>`) which can be found under the [Storage](https://engaging-web.mit.edu/eofe-wiki/storage/) page and the [BCS computing resources on Engaging](https://github.mit.edu/MGHPCC/OpenMind/wiki/User-guide-for-BCS-computing-resources-on-Engaging) wiki.

## Best practices

1. Please be respectful of these resources as they are used by many groups.
1. Only run resource intensive scripts on the compute nodes and not on the login/head nodes.
1. Only run the steps in your script on a GPU compute node if those steps require a GPU. All other steps should be run on a CPU-only compute node.
1. Monitor your jobs frequently (`squeue -u <username>`).
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ nav:
- Add Webknossos dataset: "webknossos-add-dataset.md"
- Deployment: https://github.com/lincbrain/webknossos/blob/ak-dev/LINC_DEPLOYMENT.md
- Development: https://github.com/lincbrain/webknossos/blob/ak-dev/LINC_DEVELOPMENT.md
- Engaging Cluster: "engaging.md"
- DANDI Docs: https://www.dandiarchive.org/handbook
- API:
- DANDI Client: https://dandi.readthedocs.io
Expand Down
Loading