DDP training on local GPUs¶
LUMINA's training entry point (example/opf/train_opf_ddp.py) is built around torchrun and PyTorch's DistributedDataParallel. This notebook shows how to launch a multi-GPU run on a single workstation — useful for development before moving to HPC.
DDP cannot run inside a single Jupyter kernel: each rank needs its own Python process. We launch torchrun as a subprocess and stream its logs.
import torch
n_gpus = torch.cuda.device_count()
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPUs visible: {n_gpus}')
for i in range(n_gpus):
print(f' [{i}] {torch.cuda.get_device_name(i)}')
1. Pick a config¶
configs/config.yaml is the canonical reference. For a workstation, drop training.batch_size and loader.num_workers to fit the smaller machine.
from pathlib import Path
import yaml
# TODO: please replace the following with your actual paths if needed
REPO = Path('./')
config_src = REPO / 'configs' / 'config.yaml'
with config_src.open() as f:
cfg = yaml.safe_load(f)
cfg.setdefault('training', {})['batch_size'] = 32
cfg.setdefault('loader', {})['num_workers'] = 2
cfg.setdefault('training', {}).setdefault('epochs', 5)
config_local = REPO / 'configs' / 'config.local.ddp.yaml'
with config_local.open('w') as f:
yaml.safe_dump(cfg, f, sort_keys=False)
print('wrote', config_local)
2. Launch via torchrun¶
We use --standalone so torchrun handles rendezvous on a free local port. --nproc_per_node should match the GPUs you want to use.
import subprocess, sys
nproc = max(1, min(n_gpus, 2)) # cap at 2 for the demo
cmd = [
'torchrun',
'--standalone',
f'--nproc_per_node={nproc}',
'example/opf/train_opf_ddp.py',
'--config', str(config_local),
'--cases', 'case14',
'--group_ids', '0',
'--model_type', 'HeteroGNN',
'--loss_type', 'mse',
]
print(' '.join(cmd))
proc = subprocess.run(cmd, cwd=REPO, capture_output=True, text=True)
print('---- stdout (tail) ----')
print('\n'.join(proc.stdout.splitlines()[-30:]))
print('---- stderr (tail) ----')
print('\n'.join(proc.stderr.splitlines()[-15:]))
print('exit:', proc.returncode)
3. Common pitfalls¶
- Hangs at startup.
--standalonepicks a port; if you re-run while a previous run is still cleaning up, you can hitAddress already in use. Wait ~30s or set--rdzv_endpoint=localhost:0. - OOM on rank 0 only. W&B logging happens only on rank 0; if you're checkpointing on the same rank, lower
training.batch_sizeorloader.num_workers. - Dataloader pickling errors. Custom transforms must be top-level (no lambdas, no closures). Move helpers into a module that workers can import.
- Different GPUs. Heterogeneous local GPUs (e.g., 3090 + A100) work but throughput is gated by the slowest rank.
4. Next: scale to a cluster¶
The Polaris notebook (06) shows how this same script runs on multiple nodes via PBS + torchrun --rdzv_backend=c10d --rdzv_endpoint.