HPC Training¶
This guide covers running LUMINA on Argonne's Polaris and NERSC's Perlmutter supercomputers.
Substitute the UPPERCASE placeholders for your environment
The job scripts below contain <UPPERCASE> placeholders that you must replace before submitting:
<HPC_ACCOUNT>— your allocation ID<CONDA_ENV_PATH>— path to your conda env<LUMINA_REPO_PATH>— your local clone oflumina-sdk
Polaris (ALCF)¶
Single-node script¶
Use the pre-built Polaris config:
#!/bin/bash
#PBS -l select=1:system=polaris
#PBS -l walltime=02:00:00
#PBS -q prod
#PBS -A <HPC_ACCOUNT>
module load conda
conda activate <CONDA_ENV_PATH>
cd <LUMINA_REPO_PATH>
export MASTER_ADDR=$(hostname).hsn.cm.polaris.alcf.anl.gov
export MASTER_PORT=29500
mpiexec -n 1 -ppn 4 \
python example/opf/train_opf_ddp.py \
--config configs/config.polaris.ddp.yaml \
--cases case14 case30 case118 \
--group_ids 0 1 2 3 4
Multi-node DDP script¶
Use the pre-built Polaris config:
#!/bin/bash
#PBS -l select=2:system=polaris
#PBS -l walltime=02:00:00
#PBS -q prod
#PBS -A <HPC_ACCOUNT>
module load conda
conda activate <CONDA_ENV_PATH>
cd <LUMINA_REPO_PATH>
NNODES=$(cat $PBS_NODEFILE | sort | uniq | wc -l)
NGPUS_PER_NODE=4
NTOTGPUS=$((NNODES * NGPUS_PER_NODE))
export MASTER_ADDR=$(hostname).hsn.cm.polaris.alcf.anl.gov
export MASTER_PORT=29500
mpiexec -n ${NTOTGPUS} -ppn ${NGPUS_PER_NODE} \
python example/opf/train_opf_ddp.py \
--config configs/config.polaris.ddp.yaml \
--cases case14 case118 case2000 \
--group_ids 0 1 2 3 4 5 6 7 8 9
Perlmutter (NERSC)¶
Single-node script¶
#!/bin/bash
#SBATCH -N 1
#SBATCH -C gpu
#SBATCH -G 8
#SBATCH -t 02:00:00
#SBATCH -q regular
#SBATCH -A <HPC_ACCOUNT>
module load pytorch
cd <LUMINA_REPO_PATH>
pip install -e .
export SLURM_CPU_BIND="cores"
export MASTER_PORT=${MASTER_PORT:-29500}
export MASTER_ADDR=${MASTER_ADDR:-$(scontrol show hostnames "$SLURM_NODELIST" | head -n 1)}
export OMP_NUM_THREADS=32
srun --ntasks-per-node 4 --gpus-per-task 1 \
python example/opf/train_opf_ddp.py \
--config configs/config.perlmutter.ddp.yaml \
--cases case14 case118 \
--group_ids 0 1
Multi-node DDP script¶
#!/bin/bash
#SBATCH -N 2
#SBATCH -C gpu
#SBATCH -G 8
#SBATCH -t 02:00:00
#SBATCH -q regular
#SBATCH -A <HPC_ACCOUNT>
module load pytorch
cd <LUMINA_REPO_PATH>/
pip install -e .
export SLURM_CPU_BIND="cores"
export MASTER_PORT=${MASTER_PORT:-29500}
export MASTER_ADDR=${MASTER_ADDR:-$(scontrol show hostnames "$SLURM_NODELIST" | head -n 1)}
export OMP_NUM_THREADS=32
srun --ntasks-per-node 4 --gpus-per-task 1 \ # optional to specify -N if using a subset of nodes
python example/opf/train_opf_ddp.py \
--config configs/config.perlmutter.ddp.yaml \
--cases case14 case118 case2000 \
--group_ids 0 1 2 3 4
Multi-Node Tips¶
- Data staging: Use
data.staging.rootin config to stage datasets to node-local storage (e.g.,$TMPDIR) - Gradient accumulation: Set
training.accumulate_grad_batchesto simulate larger batch sizes - Sharded datasets: For large cases, pre-build shards with
scripts/opf_build_shards.py - W&B logging: Only rank 0 logs to W&B; use
--wandbflag
Existing HPC Documentation¶
Additional system-specific docs: