Multi-node training on Polaris (ALCF)¶

Polaris has 4 NVIDIA A100 GPUs per node and uses PBS Pro for job scheduling. This notebook is illustrative — the cells are not executed in CI because they need an ALCF allocation. Run them on a Polaris login node.

Reference: Polaris user guide.

!!! note "Substitute the UPPERCASE placeholders for your environment" The cells below contain <UPPERCASE> placeholders that you must replace before running:

- `<CONDA_ENV_PATH>` — path where you'll create / find your conda env
- `<LUMINA_REPO_PATH>` — your local clone of `lumina-sdk`
- `<DATASET_ROOT>` — root of your OPFData stage
- `<HPC_ACCOUNT>` — your ALCF allocation
- `<WANDB_PROJECT>` — your Weights & Biases project name

1. Environment setup¶

Build the conda env once, then conda activate it from your job script. Putting the env on /eagle keeps it reachable from compute nodes.

In [ ]:

Copied!





%%bash
module load conda
conda create -p <CONDA_ENV_PATH> python=3.11 -y
conda activate <CONDA_ENV_PATH>

# torch + PyG matching Polaris CUDA
pip install torch torch-geometric

cd <LUMINA_REPO_PATH>
pip install -e ".[acopf,hps]"
%%bash
module load conda
conda create -p <CONDA_ENV_PATH> python=3.11 -y
conda activate <CONDA_ENV_PATH>

# torch + PyG matching Polaris CUDA
pip install torch torch-geometric

cd <LUMINA_REPO_PATH>
pip install -e ".[acopf,hps]"

2. Stage the dataset¶

Reading raw OPFData from Lustre on every step kills throughput. Pre-stage a copy to node-local SSD and point data.staging.root at $TMPDIR.

In [ ]:

Copied!





%%bash
# Inside a PBS job
mkdir -p $TMPDIR/OPFData
cp -r <DATASET_ROOT>/OPFData/processed/dataset_release_1/pglib_opf_case14_ieee \
      $TMPDIR/OPFData/
%%bash
# Inside a PBS job
mkdir -p $TMPDIR/OPFData
cp -r <DATASET_ROOT>/OPFData/processed/dataset_release_1/pglib_opf_case14_ieee \
      $TMPDIR/OPFData/

3. Configure for Polaris¶

Use configs/config.polaris.ddp.yaml as the starting point. Override what changes between jobs (cases, group_ids, batch size) on the CLI.

In [ ]:

Copied!

%%bash
head -40 configs/config.polaris.ddp.yaml
%%bash
head -40 configs/config.polaris.ddp.yaml

4. PBS job script¶

Save as job-polaris.sh.

In [ ]:

Copied!





%%writefile job-polaris.sh
#!/bin/bash
#PBS -l select=4:system=polaris
#PBS -l walltime=02:00:00
#PBS -l filesystems=home:eagle
#PBS -q prod
#PBS -A <HPC_ACCOUNT>
#PBS -N lumina

module load conda
conda activate <CONDA_ENV_PATH>
cd <LUMINA_REPO_PATH>

NNODES=$(cat $PBS_NODEFILE | sort | uniq | wc -l)
NGPUS_PER_NODE=4
NTOTGPUS=$((NNODES * NGPUS_PER_NODE))

export MASTER_ADDR=$(hostname).hsn.cm.polaris.alcf.anl.gov
export MASTER_PORT=29500

mpiexec -n ${NTOTGPUS} -ppn ${NGPUS_PER_NODE} \
    python example/opf/train_opf_ddp.py \
    --config configs/config.polaris.ddp.yaml \
    --cases case14 case118 case2000 \
    --group_ids 0 1 2 3 4 \
    --model_type HeteroGNN \
    --loss_type mse \
    --wandb \
    --wandb_project <WANDB_PROJECT>
%%writefile job-polaris.sh
#!/bin/bash
#PBS -l select=4:system=polaris
#PBS -l walltime=02:00:00
#PBS -l filesystems=home:eagle
#PBS -q prod
#PBS -A <HPC_ACCOUNT>
#PBS -N lumina

module load conda
conda activate <CONDA_ENV_PATH>
cd <LUMINA_REPO_PATH>

NNODES=$(cat $PBS_NODEFILE | sort | uniq | wc -l)
NGPUS_PER_NODE=4
NTOTGPUS=$((NNODES * NGPUS_PER_NODE))

export MASTER_ADDR=$(hostname).hsn.cm.polaris.alcf.anl.gov
export MASTER_PORT=29500

mpiexec -n ${NTOTGPUS} -ppn ${NGPUS_PER_NODE} \
    python example/opf/train_opf_ddp.py \
    --config configs/config.polaris.ddp.yaml \
    --cases case14 case118 case2000 \
    --group_ids 0 1 2 3 4 \
    --model_type HeteroGNN \
    --loss_type mse \
    --wandb \
    --wandb_project <WANDB_PROJECT>

5. Submit¶

In [ ]:

Copied!

%%bash
qsub job-polaris.sh
qstat -u $USER
%%bash
qsub job-polaris.sh
qstat -u $USER

6. Operational tips¶

Rendezvous failures usually mean the master node is unreachable or the port is taken. Pick a different MASTER_PORT and re-submit.
Throughput drop after first epoch. Almost always I/O — confirm data.staging.root is set and the dataset really landed on $TMPDIR.
Imbalanced load across cases. With multi-case training, set training.case_sampling='proportional' so larger cases get more steps.
Checkpointing. Write to /eagle (not /home); rank-0-only writes are fine for HeteroGNN sizes typical here.
W&B offline. Polaris compute nodes have no internet; set WANDB_MODE=offline and use wandb sync from a login node afterwards. See W&B Sweeps on Perlmutter for a worked example.