Multi-node training on Polaris (ALCF)¶
Polaris has 4 NVIDIA A100 GPUs per node and uses PBS Pro for job scheduling. This notebook is illustrative — the cells are not executed in CI because they need an ALCF allocation. Run them on a Polaris login node.
Reference: Polaris user guide.
!!! note "Substitute the UPPERCASE placeholders for your environment"
The cells below contain <UPPERCASE> placeholders that you must replace before running:
- `<CONDA_ENV_PATH>` — path where you'll create / find your conda env
- `<LUMINA_REPO_PATH>` — your local clone of `lumina-sdk`
- `<DATASET_ROOT>` — root of your OPFData stage
- `<HPC_ACCOUNT>` — your ALCF allocation
- `<WANDB_PROJECT>` — your Weights & Biases project name
1. Environment setup¶
Build the conda env once, then conda activate it from your job script. Putting the env on /eagle keeps it reachable from compute nodes.
%%bash
module load conda
conda create -p <CONDA_ENV_PATH> python=3.11 -y
conda activate <CONDA_ENV_PATH>
# torch + PyG matching Polaris CUDA
pip install torch torch-geometric
cd <LUMINA_REPO_PATH>
pip install -e ".[acopf,hps]"
2. Stage the dataset¶
Reading raw OPFData from Lustre on every step kills throughput. Pre-stage a copy to node-local SSD and point data.staging.root at $TMPDIR.
%%bash
# Inside a PBS job
mkdir -p $TMPDIR/OPFData
cp -r <DATASET_ROOT>/OPFData/processed/dataset_release_1/pglib_opf_case14_ieee \
$TMPDIR/OPFData/
3. Configure for Polaris¶
Use configs/config.polaris.ddp.yaml as the starting point. Override what changes between jobs (cases, group_ids, batch size) on the CLI.
%%bash
head -40 configs/config.polaris.ddp.yaml
4. PBS job script¶
Save as job-polaris.sh.
%%writefile job-polaris.sh
#!/bin/bash
#PBS -l select=4:system=polaris
#PBS -l walltime=02:00:00
#PBS -l filesystems=home:eagle
#PBS -q prod
#PBS -A <HPC_ACCOUNT>
#PBS -N lumina
module load conda
conda activate <CONDA_ENV_PATH>
cd <LUMINA_REPO_PATH>
NNODES=$(cat $PBS_NODEFILE | sort | uniq | wc -l)
NGPUS_PER_NODE=4
NTOTGPUS=$((NNODES * NGPUS_PER_NODE))
export MASTER_ADDR=$(hostname).hsn.cm.polaris.alcf.anl.gov
export MASTER_PORT=29500
mpiexec -n ${NTOTGPUS} -ppn ${NGPUS_PER_NODE} \
python example/opf/train_opf_ddp.py \
--config configs/config.polaris.ddp.yaml \
--cases case14 case118 case2000 \
--group_ids 0 1 2 3 4 \
--model_type HeteroGNN \
--loss_type mse \
--wandb \
--wandb_project <WANDB_PROJECT>
5. Submit¶
%%bash
qsub job-polaris.sh
qstat -u $USER
6. Operational tips¶
- Rendezvous failures usually mean the master node is unreachable or the port is taken. Pick a different
MASTER_PORTand re-submit. - Throughput drop after first epoch. Almost always I/O — confirm
data.staging.rootis set and the dataset really landed on$TMPDIR. - Imbalanced load across cases. With multi-case training, set
training.case_sampling='proportional'so larger cases get more steps. - Checkpointing. Write to
/eagle(not/home); rank-0-only writes are fine for HeteroGNN sizes typical here. - W&B offline. Polaris compute nodes have no internet; set
WANDB_MODE=offlineand usewandb syncfrom a login node afterwards. See W&B Sweeps on Perlmutter for a worked example.