Skip to content

Commit

Permalink
Merge pull request #33 from kabilar/engaging
Browse files Browse the repository at this point in the history
Create Engaging Cluster documentation
  • Loading branch information
kabilar authored Jan 8, 2025
2 parents de129d0 + ac3d4d3 commit aea19a5
Show file tree
Hide file tree
Showing 2 changed files with 54 additions and 0 deletions.
53 changes: 53 additions & 0 deletions docs/engaging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# MIT Engaging High Performance Compute Cluster

The Engaging High Performance Compute Cluster is available to LINC team members to process their jobs at scale, including with the use of GPUs.

## Create an account

In order to access the Engaging Cluster, you will need a MIT Sponsored Account.

1. Please contact Kabi at [email protected] with your organization name, date of birth, and phone number.
2. Once the sponsored account is approved, you will receive an email to complete account registration and establish your MIT Kerberos identity.
3. Please send your Kerberos ID to Kabi so that he can add you to the WebMoira group (`orcd_ug_pg_linc_all`) so that you can access the Engaging Cluster.

## Documentation overview

The MIT Office of Research Computing and Data (ORCD) manage the Engaging Cluster. Most of the information you will need is in the first link below but there are additional resources:

1. [Engaging Cluster docs](https://engaging-web.mit.edu/eofe-wiki/)
1. [ORCD Docs](https://orcd-docs.mit.edu/)
1. [MGHPCC OpenMind GitHub wiki](https://github.mit.edu/MGHPCC/OpenMind/wiki)
1. [Slurm docs](https://slurm.schedmd.com/overview.html)

## Access the cluster and run jobs

The Engaging Cluster has head/login nodes to access the cluster and submit jobs to the compute nodes which run your resource intensive scripts. Job orchestration is performed with the Slurm Workload Manager. The [Engaging Cluster Documentation](https://engaging-web.mit.edu/eofe-wiki/) provides details on these operations, including:

1. [Logging into the cluster](https://engaging-web.mit.edu/eofe-wiki/logging_in/)
1. [Cluster architecture including information on the head/login nodes versus compute nodes](https://engaging-web.mit.edu/eofe-wiki/slurm/cluster_workflow/)
1. [Common commands to interact with the Slurm Job Scheduler](https://engaging-web.mit.edu/eofe-wiki/slurm/slurm/)
1. [Run multiple jobs in parallel with `sbatch`](https://engaging-web.mit.edu/eofe-wiki/slurm/sbatch/)
1. [Run interactive jobs on a single compute node with `srun` or `salloc`](https://engaging-web.mit.edu/eofe-wiki/slurm/srun/)
1. [Access installed software](https://engaging-web.mit.edu/eofe-wiki/software/load_modules/)
1. [Determining resources for your job](https://engaging-web.mit.edu/eofe-wiki/slurm/resources/)

Slurm is a common workload manager so you can also refer to the official [Slurm documentation](https://slurm.schedmd.com/overview.html).

## Compute nodes

The Engaging Cluster has several CPU-only compute nodes and GPU compute nodes. The nodes are categorized according to partitions to control which groups have access to the nodes.

See [Determining resources for your job](https://engaging-web.mit.edu/eofe-wiki/slurm/resources/) for details on selecting the nodes and resources for your jobs. Briefly, the `sinfo` command shows the partitions where you can submit jobs and you can submit jobs to a certain partition by including `#SBATCH --partition=<partition_name>` in your sbatch script.

The GPU nodes are available through the `ou_bcs_high` and `ou_bcs_low` partitions. For more details, see the [BCS computing resources on Engaging - Slurm configuration](https://github.mit.edu/MGHPCC/OpenMind/wiki/User-guide-for-BCS-computing-resources-on-Engaging#slurm-configuration) wiki.

## Data storage

Data can be stored under the following path: `/orcd/data/linc/`. We will be working to create an organization strategy for the LINC project data but for now please store your data under a subdirectory (e.g. `/orcd/data/linc/<username>` or `/orcd/data/linc/<projectname>`). There are additional locations to store your data including the use of scratch space (`/orcd/scratch/bcs/001`, `/orcd/scratch/bcs/002`, `/pool001/<username>`) which can be found under the [Storage](https://engaging-web.mit.edu/eofe-wiki/storage/) page and the [BCS computing resources on Engaging](https://github.mit.edu/MGHPCC/OpenMind/wiki/User-guide-for-BCS-computing-resources-on-Engaging) wiki.

## Best practices

1. Please be respectful of these resources as they are used by many groups.
1. Only run resource intensive scripts on the compute nodes and not on the login/head nodes.
1. Only run the steps in your script on a GPU compute node if those steps require a GPU. All other steps should be run on a CPU-only compute node.
1. Monitor your jobs frequently (`squeue -u <username>`).
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ nav:
- Add Webknossos dataset: "webknossos-add-dataset.md"
- Deployment: https://github.com/lincbrain/webknossos/blob/ak-dev/LINC_DEPLOYMENT.md
- Development: https://github.com/lincbrain/webknossos/blob/ak-dev/LINC_DEVELOPMENT.md
- Engaging Cluster: "engaging.md"
- DANDI Docs: https://www.dandiarchive.org/handbook
- API:
- DANDI Client: https://dandi.readthedocs.io
Expand Down

0 comments on commit aea19a5

Please sign in to comment.