Skip to content

Commit

Permalink
more README; better per-rank timing info
Browse files Browse the repository at this point in the history
  • Loading branch information
moustakas committed Dec 26, 2024
1 parent 2e7e1f8 commit 122efa4
Show file tree
Hide file tree
Showing 2 changed files with 54 additions and 11 deletions.
62 changes: 51 additions & 11 deletions podman/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,19 +10,17 @@ instructions below illustrate how we build the container for use at
All images are tagged and have been publicly checked into
[dockerhub/desihub](https://hub.docker.com/orgs/desihub/repositories).

## Logging into dockerhub (optional)
## Log into dockerhub (optional)

When building a container, first log into `dockerhub`:
When building a container, first log into `dockerhub` (credentials required):
```
podman-hpc login docker.io
Username:
Password:
```

## Build the base container

We first build a "base" container to hold our installation of
[mpich](https://www.mpich.org/) and
[MPICH](https://www.mpich.org/) and
[mpi4py](https://mpi4py.readthedocs.io/en/stable/). Using a base image allows us
to make changes to the top-level container without having to rebuild the base
image, which can take up to 20 minutes. (Note that a two-stage build does not
Expand All @@ -46,18 +44,57 @@ podman-hpc migrate desihub/fastspecfit:3.1.1
podman-hpc push desihub/fastspecfit:3.1.1
```

## Deploy the container

To deploy the container in production please refer to the [FastSpecFit
documentation](https://fastspecfit.readthedocs.io/en/latest/) for the latest
details and instructions. However, briefly, one can test the container in an
interactive node:
```
salloc -N 1 -C cpu -A desi -t 01:00:00 --qos interactive
podman-hpc pull desihub/fastspecfit:3.1.1
```

First, make sure `mpi4py` works with the [Cray-optimized version of
MPICH](https://docs.nersc.gov/development/containers/podman-hpc/overview/#using-cray-mpich-in-podman-hpc) by running
```
podman-hpc run --rm --mpi desihub/fastspecfit:3.1.1 python -m mpi4py --mpi-lib-version
MPI VERSION : CRAY MPICH version 8.1.22.12 (ANL base 3.4a2)
MPI BUILD INFO : Wed Nov 09 12:31 2022 (git hash cfc6f82)
```
and
```
srun --ntasks=4 podman-hpc run --rm --mpi desihub/fastspecfit:3.1.1 python -m mpi4py.bench helloworld
Hello, World! I am process 0 of 4 on nid200010.
Hello, World! I am process 1 of 4 on nid200010.
Hello, World! I am process 2 of 4 on nid200010.
Hello, World! I am process 3 of 4 on nid200010.
```

Next, try running a couple healpixels:
```
srun --ntasks=8 podman-hpc run --rm --mpi --group-add keep-groups --volume=/dvs_ro/cfs/cdirs:/dvs_ro/cfs/cdirs \
--volume=/global/cfs/cdirs:/global/cfs/cdirs --volume=$PSCRATCH:/scratch desihub/fastspecfit:3.1.1 mpi-fastspecfit \
--specprod=loa --survey=sv3 --program=dark --healpix=26278,26279 --ntargets=16 --mp=4 --outdir-data=/scratch/fasttest
```

## Handy commands

* List the available images:
```
podman-hpc images
REPOSITORY TAG IMAGE ID CREATED SIZE R/O
localhost/desihub/fastspecfit 3.1.1 31a5c5041d1d 50 minutes ago 1.67 GB true
localhost/desihub/fastspecfit-base 1.0 abd6485d4cb5 16 hours ago 719 MB true
```

* Check the installed versions of `mpich` and `mpi4py`:
```
podman-hpc run --rm desihub/fastspecfit:3.1.1 python -m mpi4py --version
mpi4py 4.0.1
```
and
```
podman-hpc run --rm desihub/fastspecfit:3.1.1 python -m mpi4py --mpi-lib-version
MPICH Version:3.4.3
MPICH Release date:Thu Dec 16 11:20:57 CST 2021
Expand All @@ -70,13 +107,16 @@ MPICH F77:gfortran -fallow-argument-mismatch -O2
MPICH FC:gfortran -fallow-argument-mismatch -O2
```

* To delete a container:
* To make sure the `numba` cache is being used correctly, in the production
example, above, simply set the `NUMBA_DEBUG_CACHE` environment variable
on-the-fly:
```
podman-hpc rmi desihub/fastspecfit:3.1.1
srun --ntasks=8 podman-hpc run --rm --mpi --group-add keep-groups --volume=/dvs_ro/cfs/cdirs:/dvs_ro/cfs/cdirs \
--volume=/global/cfs/cdirs:/global/cfs/cdirs --volume=$PSCRATCH:/scratch --env NUMBA_DEBUG_CACHE=1 desihub/fastspecfit:3.1.1 mpi-fastspecfit \
--specprod=loa --survey=sv3 --program=dark --healpix=26278,26279 --ntargets=16 --mp=4 --outdir-data=/scratch/fasttest
```

* To delete a container:
```
srun --ntasks=4 podman-hpc run --rm --mpi desihub/fastspecfit-base:1.0 python -m mpi4py.bench helloworld
podman-hpc run --userns keep-id --group-add keep-groups --rm --volume=/dvs_ro/cfs/cdirs:/dvs_ro/cfs/cdirs --volume=/pscratch/sd/i/ioannis:/scratch desihub/fastspecfit:3.1.1 /bin/bash
podman-hpc run --userns keep-id --group-add keep-groups --rm --volume=/dvs_ro/cfs/cdirs:/dvs_ro/cfs/cdirs --volume=/global/cfs/cdirs:/global/cfs/cdirs --volume=/pscratch/sd/i/ioannis:/scratch --env NUMBA_CACHE_DIR=/scratch/numba_cache desihub/fastspecfit:3.1.1 /bin/bash
podman-hpc rmi desihub/fastspecfit:3.1.1
```
3 changes: 3 additions & 0 deletions py/fastspecfit/fastspecfit.py
Original file line number Diff line number Diff line change
Expand Up @@ -319,9 +319,12 @@ def fastspec(fastphot=False, fitstack=False, args=None, comm=None, verbose=False
# Each rank, including rank 0, iterates over each object and then sends
# the results to rank 0.
log.info(f'Rank {rank}: fitting {len(fitargs_onerank):,d} objects.')
t1 = time.time()
out = []
for fitarg_onerank in fitargs_onerank:
out.append(fastspec_one(**fitarg_onerank))
log.info(f'Rank {rank}: done fitting {len(out):,d} objects in ' + \
f'{(time.time()-t1:.2f) / 60.} minutes.')

if rank > 0:
#log.debug(f'Rank {rank} sending data on {len(out)} objects to rank 0.')
Expand Down

0 comments on commit 122efa4

Please sign in to comment.