W&B Sweep on Perlmutter¶
This guide shows the pattern we use for a W&B sweep of LUMINA training on NERSC Perlmutter:
- A sweep config (
opf_<model>.yaml) that defines the hyper-parameter grid and the per-run command. - A per-run launcher script that the W&B agent invokes once per trial — it sets up DDP and calls
example/opf/train_opf_ddp.pywith the sweep overrides. - A SLURM job script that allocates nodes and runs one or more
wandb agentprocesses.
The example files mentioned below (configs/sweeps/..., scripts/perlmutter/...) are not shipped with the repo — they're git-ignored because they contain account-specific settings. Use the templates in this guide as a starting point.
Substitute the UPPERCASE placeholders for your environment
The snippets below use <UPPERCASE> placeholders that you must replace before submitting:
<HPC_ACCOUNT>— your NERSC allocation (e.g.m1234)<CONDA_ENV_PATH>— your conda env on${CFS}or${PSCRATCH}<LUMINA_REPO_PATH>— your local clone oflumina-sdk<WANDB_ENTITY>,<WANDB_PROJECT>,<SWEEP_ID>— your W&B identifiers
Step 1: Define the sweep¶
Create a YAML under configs/sweeps/ (gitignored). The command block is what the W&B agent runs for each trial — it points at your per-run launcher script.
# configs/sweeps/opf_hgt.yaml
program: scripts/perlmutter/sweep/launch.hgt.sh
method: bayes
metric:
name: val/loss/total
goal: minimize
command:
- ${env}
- ${program}
- ${args}
parameters:
models.HGT.hidden_channels:
values: [128, 256, 512]
models.HGT.num_layers:
values: [4, 6, 8]
models.HGT.num_heads:
values: [2, 4, 8]
models.HGT.dropout:
values: [0.0, 0.1, 0.2]
optimizer.AdamW.lr:
values: [1.0e-4, 5.0e-4, 1.0e-3]
wandb_project:
value: <WANDB_PROJECT>
Register the sweep on a login node:
export WANDB_API_KEY=<YOUR_KEY>
wandb sweep configs/sweeps/opf_hgt.yaml
# prints a sweep path like <WANDB_ENTITY>/<WANDB_PROJECT>/<SWEEP_ID>
Step 2: Per-run launcher (launch.hgt.sh)¶
This script is invoked by the W&B agent for each trial. It uses the SLURM allocation already granted to the agent's job and forwards W&B-injected hyperparameter overrides via "$@".
#!/bin/bash
set -euo pipefail
export SLURM_CPU_BIND="cores"
export MASTER_PORT=${MASTER_PORT:-29500}
export MASTER_ADDR=${MASTER_ADDR:-$(scontrol show hostnames "$SLURM_NODELIST" | head -n 1)}
export OMP_NUM_THREADS=32
module load conda
conda activate <CONDA_ENV_PATH>
srun --ntasks=$SLURM_JOB_NUM_NODES --ntasks-per-node=1 \
python -m torch.distributed.run \
--nnodes=$SLURM_JOB_NUM_NODES \
--nproc_per_node=$SLURM_GPUS_ON_NODE \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
example/opf/train_opf_ddp.py \
--config configs/config.perlmutter.ddp.yaml \
--cases case30 \
--group_ids 0 1 \
--model_type HGT \
--hetero_model_config configs/model/heterognn.yaml \
--loss_type mse \
--wandb \
"$@"
Make it executable: chmod +x scripts/perlmutter/sweep/launch.hgt.sh.
Step 3: SLURM agent job (launch.sweep.sl)¶
Each SLURM array task runs one wandb agent, which in turn invokes the launcher above for each trial pulled from the sweep queue.
#!/bin/bash
#SBATCH -A <HPC_ACCOUNT>
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 2:30:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH -c 32
#SBATCH --gpus-per-task=4
#SBATCH --gpu-bind=none
#SBATCH --array=0-49
export SLURM_CPU_BIND="cores"
export MASTER_PORT=29500
export MASTER_ADDR=$(hostname)
export OMP_NUM_THREADS=32
module load conda
conda activate <CONDA_ENV_PATH>
# Total runs = (array size) * --count
wandb agent --count 1 <WANDB_ENTITY>/<WANDB_PROJECT>/<SWEEP_ID>
Submit:
Monitoring¶
sqs— your job queue status- W&B UI — sweep dashboard under
<WANDB_ENTITY>/<WANDB_PROJECT> - Logs —
slurm-<jobid>_<arrayid>.outin the submission directory
Common pitfalls¶
WANDB_API_KEYnot set on compute nodes. Either export it in your job script or store it in~/.netrcso the agent picks it up.- Sweep config path mismatch. The
program:line in the sweep YAML is interpreted relative to the working directory of the agent, not the sweep config file. Use a path from<LUMINA_REPO_PATH>(e.g.scripts/perlmutter/sweep/launch.hgt.sh). - Idle agents. If
wandb agent --count 1finishes faster than expected, check that the trial actually ran (look atslurm-*.out). Sweep config syntax errors cause agents to exit immediately without running anything. - Disk locality. For large cases, stage the dataset to
$PSCRATCHand pointrootin the perlmutter config at it; reading raw OPFData over CFS will bottleneck training.