DDP training on local GPUs¶

LUMINA's training entry point (example/opf/train_opf_ddp.py) is built around torchrun and PyTorch's DistributedDataParallel. This notebook shows how to launch a multi-GPU run on a single workstation — useful for development before moving to HPC.

DDP cannot run inside a single Jupyter kernel: each rank needs its own Python process. We launch torchrun as a subprocess and stream its logs.

In [ ]:

Copied!





import torch

n_gpus = torch.cuda.device_count()
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPUs visible:   {n_gpus}')
for i in range(n_gpus):
    print(f'  [{i}] {torch.cuda.get_device_name(i)}')
import torch

n_gpus = torch.cuda.device_count()
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPUs visible:   {n_gpus}')
for i in range(n_gpus):
    print(f'  [{i}] {torch.cuda.get_device_name(i)}')

1. Pick a config¶

configs/config.yaml is the canonical reference. For a workstation, drop training.batch_size and loader.num_workers to fit the smaller machine.

In [ ]:

Copied!





from pathlib import Path
import yaml

# TODO: please replace the following with your actual paths if needed
REPO = Path('./')
config_src = REPO / 'configs' / 'config.yaml'

with config_src.open() as f:
    cfg = yaml.safe_load(f)

cfg.setdefault('training', {})['batch_size'] = 32
cfg.setdefault('loader', {})['num_workers'] = 2
cfg.setdefault('training', {}).setdefault('epochs', 5)

config_local = REPO / 'configs' / 'config.local.ddp.yaml'
with config_local.open('w') as f:
    yaml.safe_dump(cfg, f, sort_keys=False)
print('wrote', config_local)
from pathlib import Path
import yaml

# TODO: please replace the following with your actual paths if needed
REPO = Path('./')
config_src = REPO / 'configs' / 'config.yaml'

with config_src.open() as f:
    cfg = yaml.safe_load(f)

cfg.setdefault('training', {})['batch_size'] = 32
cfg.setdefault('loader', {})['num_workers'] = 2
cfg.setdefault('training', {}).setdefault('epochs', 5)

config_local = REPO / 'configs' / 'config.local.ddp.yaml'
with config_local.open('w') as f:
    yaml.safe_dump(cfg, f, sort_keys=False)
print('wrote', config_local)

2. Launch via torchrun¶

We use --standalone so torchrun handles rendezvous on a free local port. --nproc_per_node should match the GPUs you want to use.

In [ ]:

Copied!





import subprocess, sys

nproc = max(1, min(n_gpus, 2))   # cap at 2 for the demo

cmd = [
    'torchrun',
    '--standalone',
    f'--nproc_per_node={nproc}',
    'example/opf/train_opf_ddp.py',
    '--config', str(config_local),
    '--cases', 'case14',
    '--group_ids', '0',
    '--model_type', 'HeteroGNN',
    '--loss_type', 'mse',
]
print(' '.join(cmd))

proc = subprocess.run(cmd, cwd=REPO, capture_output=True, text=True)
print('---- stdout (tail) ----')
print('\n'.join(proc.stdout.splitlines()[-30:]))
print('---- stderr (tail) ----')
print('\n'.join(proc.stderr.splitlines()[-15:]))
print('exit:', proc.returncode)
import subprocess, sys

nproc = max(1, min(n_gpus, 2))   # cap at 2 for the demo

cmd = [
    'torchrun',
    '--standalone',
    f'--nproc_per_node={nproc}',
    'example/opf/train_opf_ddp.py',
    '--config', str(config_local),
    '--cases', 'case14',
    '--group_ids', '0',
    '--model_type', 'HeteroGNN',
    '--loss_type', 'mse',
]
print(' '.join(cmd))

proc = subprocess.run(cmd, cwd=REPO, capture_output=True, text=True)
print('---- stdout (tail) ----')
print('\n'.join(proc.stdout.splitlines()[-30:]))
print('---- stderr (tail) ----')
print('\n'.join(proc.stderr.splitlines()[-15:]))
print('exit:', proc.returncode)

3. Common pitfalls¶

Hangs at startup. --standalone picks a port; if you re-run while a previous run is still cleaning up, you can hit Address already in use. Wait ~30s or set --rdzv_endpoint=localhost:0.
OOM on rank 0 only. W&B logging happens only on rank 0; if you're checkpointing on the same rank, lower training.batch_size or loader.num_workers.
Dataloader pickling errors. Custom transforms must be top-level (no lambdas, no closures). Move helpers into a module that workers can import.
Different GPUs. Heterogeneous local GPUs (e.g., 3090 + A100) work but throughput is gated by the slowest rank.

4. Next: scale to a cluster¶

The Polaris notebook (06) shows how this same script runs on multiple nodes via PBS + torchrun --rdzv_backend=c10d --rdzv_endpoint.