Trainer API¶

Trainers¶

`BaseOPFTrainer` ¶

Base trainer for AC Optimal Power Flow GNN models with DDP support.

Provides the shared training infrastructure used by both single-case and multi-case trainers, including optimizer/scheduler initialization, gradient clipping, non-finite loss handling, W&B logging, checkpoint management, sample-based scheduling, and throughput measurement.

Subclasses must implement _load_data, _create_dataloaders, _create_model, _initialize_loss_managers, _checkpoint_tag, train_epoch, and validate.

Parameters:

Name	Type	Description	Default
`config`	`dict`	Full training configuration (parsed YAML).	required
`model_type`	`str`	Model architecture identifier (e.g. `"HeteroGNN"`).	required
`loss_type`	`str`	Loss function name (e.g. `"mse"`, `"rmse"`).	`'mse'`
`minmax_scaling`	`bool`	Whether to apply min-max scaling in forward pass.	`True`
`local_rank`	`int`	Local GPU rank for DDP.	`0`
`global_rank`	`int`	Global process rank for DDP.	`0`
`world_size`	`int`	Total number of DDP processes.	`1`
`wandb_run_name`	`str`	Custom W&B run name.	`None`
`wandb_group_name`	`str`	W&B run group.	`None`
`wandb_requested`	`bool`	Whether W&B logging was requested.	`False`
`wandb_project`	`str`	W&B project name.	`'lumina-training'`
`wandb_entity`	`str`	W&B entity/team name.	`None`
`run_metadata`	`dict`	Extra metadata to log with the run.	`None`

Source code in lumina/trainer/opf/trainer.py

class BaseOPFTrainer:
    """Base trainer for AC Optimal Power Flow GNN models with DDP support.

    Provides the shared training infrastructure used by both single-case and
    multi-case trainers, including optimizer/scheduler initialization, gradient
    clipping, non-finite loss handling, W&B logging, checkpoint management,
    sample-based scheduling, and throughput measurement.

    Subclasses must implement ``_load_data``, ``_create_dataloaders``,
    ``_create_model``, ``_initialize_loss_managers``, ``_checkpoint_tag``,
    ``train_epoch``, and ``validate``.

    Args:
        config (dict): Full training configuration (parsed YAML).
        model_type (str): Model architecture identifier (e.g. ``"HeteroGNN"``).
        loss_type (str): Loss function name (e.g. ``"mse"``, ``"rmse"``).
        minmax_scaling (bool): Whether to apply min-max scaling in forward pass.
        local_rank (int): Local GPU rank for DDP.
        global_rank (int): Global process rank for DDP.
        world_size (int): Total number of DDP processes.
        wandb_run_name (str, optional): Custom W&B run name.
        wandb_group_name (str, optional): W&B run group.
        wandb_requested (bool): Whether W&B logging was requested.
        wandb_project (str): W&B project name.
        wandb_entity (str, optional): W&B entity/team name.
        run_metadata (dict, optional): Extra metadata to log with the run.
    """

    TRAIN_METRIC_NAMES = (
        "train/loss/total",
        "train/loss/objective",
        "train/perf/nonfinite_loss_skips",
        "train/perf/nonfinite_grad_skips",
    )
    VAL_METRIC_NAMES = (
        "val/loss/total",
        "val/loss/objective",
        "val/perf/eval_batches",
        "val/perf/data_ms",
        "val/perf/forward_ms",
        "val/perf/loss_ms",
        "val/perf/total_ms",
    )
    TRAIN_METRIC_MAP = (
        ("objective", "train/loss/objective"),
    )
    VAL_METRIC_MAP = (
    )
    TRAIN_METRIC_GROUPS = (
        (
            "train/loss/total",
            "train/loss/objective",
        ),
        (
            "train/perf/nonfinite_loss_skips",
            "train/perf/nonfinite_grad_skips",
        ),
    )
    VAL_METRIC_GROUPS = (
        ("val/loss/total", "val/loss/objective", "val/score"),
        (
            "val/perf/eval_batches",
            "val/perf/data_ms",
            "val/perf/forward_ms",
            "val/perf/loss_ms",
            "val/perf/total_ms",
        ),
    )

    def __init__(
        self,
        config,
        model_type,
        loss_type="mse",
        minmax_scaling=True,
        local_rank=0,
        global_rank=0,
        world_size=1,
        wandb_run_name=None,
        wandb_group_name=None,
        wandb_requested=False,
        wandb_project="lumina-training",
        wandb_entity=None,
        run_metadata=None,
    ):
        self.config = config
        self.model_type = model_type
        self.loss_type = loss_type
        self.minmax_scaling = minmax_scaling
        self.local_rank = local_rank
        self.global_rank = global_rank
        self.world_size = world_size
        self.device = torch.device(f"cuda:{local_rank}")
        self.wandb_run_name = wandb_run_name
        self.wandb_group_name = wandb_group_name
        self.wandb_requested = wandb_requested
        self.wandb_project = wandb_project
        self.wandb_entity = wandb_entity
        self.run_metadata = run_metadata
        self.model_summary = None
        self.model_class = None
        self.model_kwargs = None
        data_config = self.config.get("data", {})
        if "use_precomputed_homo" in data_config:
            use_precomputed = bool(data_config.get("use_precomputed_homo"))
        else:
            use_precomputed = True
        self.use_precomputed_homo = use_precomputed and self.model_type not in HETERO_MODEL_TYPES
        self.dataset_backend = str(data_config.get("backend", "in_memory")).lower()
        self.on_disk_backend = str(data_config.get("on_disk_backend", "sqlite")).lower()
        self.on_disk_write_batch_size = int(data_config.get("on_disk_write_batch_size", 128))
        self.on_disk_sqlite_timeout_sec = float(data_config.get("on_disk_sqlite_timeout_sec", 600.0))
        sqlite_busy_timeout = data_config.get("on_disk_sqlite_busy_timeout_ms")
        self.on_disk_sqlite_busy_timeout_ms = (
            int(sqlite_busy_timeout) if sqlite_busy_timeout is not None else None
        )
        self.on_disk_sqlite_journal_mode = data_config.get("on_disk_sqlite_journal_mode", "WAL")
        self.on_disk_sqlite_synchronous = data_config.get("on_disk_sqlite_synchronous", "NORMAL")
        self.data_staging = data_config.get("staging", {}) if isinstance(data_config.get("staging"), dict) else {}
        self.data_staging_lock_timeout = int(self.data_staging.get("lock_timeout_sec", 7200))
        self.on_disk_homo_suffix = str(data_config.get("on_disk_homo_suffix", "homo"))
        self.on_disk_homo_prune = bool(data_config.get("on_disk_homo_prune", True))
        self.on_disk_homo_storage_dtype = data_config.get("on_disk_homo_storage_dtype", "float16")
        self.on_disk_homo_restore_fp32 = bool(data_config.get("on_disk_homo_restore_fp32", True))
        self.on_disk_homo_attach_full_edge_attr = bool(
            data_config.get("on_disk_homo_attach_full_edge_attr", False)
        )
        self.on_disk_homo_sanitize_targets = bool(data_config.get("on_disk_homo_sanitize_targets", True))
        self.on_disk_homo_log_bad_targets = bool(data_config.get("on_disk_homo_log_bad_targets", True))
        self.on_disk_homo_max_bad_target_logs = int(data_config.get("on_disk_homo_max_bad_target_logs", 1))
        self.sharded_root = data_config.get("sharded_root")
        self.sharded_manifest_name = str(data_config.get("sharded_manifest_name", "manifest.json"))
        self.sharded_suffix = data_config.get("sharded_suffix")
        self.sharded_homo_suffix = data_config.get("sharded_homo_suffix", self.on_disk_homo_suffix)
        split_seed = data_config.get("sharded_split_seed", self.config.get("split_seed", 42))
        self.sharded_split_seed = int(split_seed)
        self.homo_dataset_kwargs = {}
        if isinstance(data_config.get("homo_dataset_kwargs"), dict):
            self.homo_dataset_kwargs.update(data_config["homo_dataset_kwargs"])
        default_homo_kwargs = {
            "processed_suffix": self.on_disk_homo_suffix,
            "attach_full_edge_attr": self.on_disk_homo_attach_full_edge_attr,
            "sanitize_targets": self.on_disk_homo_sanitize_targets,
            "log_bad_targets": self.on_disk_homo_log_bad_targets,
            "max_bad_target_logs": self.on_disk_homo_max_bad_target_logs,
        }
        for key, value in default_homo_kwargs.items():
            self.homo_dataset_kwargs.setdefault(key, value)

        training_config = self.config["training"]
        self.max_epochs = training_config["max_epochs"]
        self.patience = training_config["patience"]
        self.grad_clip_val = training_config.get("gradient_clip_val")
        self.grad_clip_algo = training_config.get("gradient_clip_algorithm", "norm")
        self.accumulate_grad_batches = max(1, int(training_config.get("accumulate_grad_batches", 1)))
        case_mix_every = training_config.get("case_mix_every_n_steps", 0)
        try:
            case_mix_every = int(case_mix_every)
        except (TypeError, ValueError):
            case_mix_every = 0
        self.case_mix_every_n_steps = max(0, case_mix_every)
        self.log_every_n_steps = training_config.get("log_every_n_steps", 0)
        self.fail_on_nonfinite = bool(training_config.get("fail_on_nonfinite", False))
        self.log_every_n_samples = int(training_config.get("log_every_n_samples", 512) or 0)
        val_every_n_epochs = training_config.get(
            "val_every_n_epochs",
            training_config.get("val_check_interval", 1),
        )
        self.val_every_n_epochs = max(1, int(val_every_n_epochs or 1))
        self.val_every_n_samples = int(training_config.get("val_every_n_samples") or 0)
        self.val_subset_samples = max(0, int(training_config.get("val_subset_samples") or 0))
        self.max_global_samples = int(training_config.get("max_global_samples") or 0)
        score_alpha = training_config.get("score_alpha", 1.0)
        self.score_alpha = float(1.0 if score_alpha is None else score_alpha)
        self.log_normalized_violation = bool(training_config.get("log_normalized_violation", False))
        violation_eval_p = training_config.get("violation_eval_p", 1.0)
        self.violation_eval_p = float(1.0 if violation_eval_p is None else violation_eval_p)
        if self.violation_eval_p < 0.0:
            self.violation_eval_p = 0.0
        elif self.violation_eval_p > 1.0:
            self.violation_eval_p = 1.0
        self.validation_timing = bool(training_config.get("validation_timing", False))
        validation_timing_every = training_config.get("validation_timing_every_n_batches", 1)
        try:
            validation_timing_every = int(validation_timing_every)
        except (TypeError, ValueError):
            validation_timing_every = 1
        self.validation_timing_every_n_batches = max(1, validation_timing_every)
        validation_timing_max = training_config.get("validation_timing_max_batches", 0)
        try:
            validation_timing_max = int(validation_timing_max)
        except (TypeError, ValueError):
            validation_timing_max = 0
        self.validation_timing_max_batches = max(0, validation_timing_max)
        min_eval_batches = training_config.get("violation_eval_min_batches", 1)
        self.violation_eval_min_batches = max(0, int(min_eval_batches or 0))
        violation_eval_seed = training_config.get("violation_eval_seed")
        self.violation_eval_seed = None if violation_eval_seed is None else int(violation_eval_seed)

        checkpoint_config = self.config.get("checkpointing", {})
        self.ckpt_every_n_epochs = int(checkpoint_config.get("every_n_epochs") or 0)
        self.ckpt_every_n_samples = int(checkpoint_config.get("every_n_samples") or 0)
        self.save_last_checkpoint = bool(checkpoint_config.get("save_last", False))

        self.checkpoint_dir = config["checkpoint_dir"]

        self._load_data()
        self._create_dataloaders()
        self._init_sample_schedules()
        self._create_model()
        self._initialize_loss_managers()
        self._init_optimizer()
        self._init_scheduler()

        self.current_epoch = 0
        self.best_val_loss = float("inf")
        self.patience_counter = 0
        self.global_step = 0
        self.global_samples = 0
        self.stop_training = False
        self._next_log_samples = self.log_every_n_samples if self.log_every_n_samples > 0 else None

        self.wandb_run = None
        self.wandb_enabled = False
        self.nonfinite_loss_skips = 0
        self.nonfinite_grad_skips = 0

        self.train_metric_names = list(self.TRAIN_METRIC_NAMES)
        self.val_metric_names = list(self.VAL_METRIC_NAMES)
        self.train_metric_map = list(self.TRAIN_METRIC_MAP)
        self.val_metric_map = list(self.VAL_METRIC_MAP)
        self.train_metric_groups = [list(group) for group in self.TRAIN_METRIC_GROUPS]
        self.val_metric_groups = [list(group) for group in self.VAL_METRIC_GROUPS]

        self._init_wandb()
        self.throughput_tracker = None
        if training_config.get("throughput_enabled", True):
            self.throughput_tracker = ThroughputTracker(
                config=self.config,
                world_size=self.world_size,
                global_rank=self.global_rank,
                get_global_step=lambda: self.global_samples,
                wandb_enabled=self.wandb_enabled,
            )
            if not self.throughput_tracker.enabled:
                self.throughput_tracker = None

    def _infer_output_dim(self, sample_data):
        if hasattr(sample_data, "node_types"):
            y = getattr(sample_data["bus"], "y", None)
        else:
            y = getattr(sample_data, "y", None)
        if y is None:
            raise ValueError("Unable to infer per-node output size from dataset sample.")
        if y.ndim <= 1:
            return 1
        return int(y.shape[-1])

    def _use_on_disk_backend(self):
        return self.dataset_backend == "on_disk"

    def _use_sharded_backend(self):
        return self.dataset_backend == "sharded"

    def _select_dataset_cls(self):
        if self._use_on_disk_backend():
            if self.model_type in HETERO_MODEL_TYPES:
                return OPFOnDiskDataset
            if self.use_precomputed_homo:
                return OPFOnDiskHomogeneousDataset
            return OPFOnDiskDataset
        return OPFHomogeneousDataset if self.use_precomputed_homo else OPFDataset

    def _resolve_sharded_root(self):
        return self.sharded_root or self.config["root"]

    def _sharded_processed_suffix(self):
        if self.use_precomputed_homo:
            return self.sharded_homo_suffix
        return self.sharded_suffix

    def _stage_on_disk(self, case_name, group_ids, dataset_cls, build_kwargs, processed_suffix=None):
        if not self._use_on_disk_backend():
            return self.config["root"]

        if not self.data_staging.get("enabled", False):
            return self.config["root"]

        stage_root = resolve_stage_root(self.data_staging)
        if not stage_root:
            if self.global_rank == 0:
                print("Warning: staging enabled but no stage root resolved; using shared root.")
            return self.config["root"]

        source_root = self.config["root"]
        if os.path.abspath(stage_root) == os.path.abspath(source_root):
            return source_root

        for group_id in group_ids:
            src_path = get_on_disk_db_path(
                source_root,
                case_name,
                group_id,
                self.on_disk_backend,
                processed_suffix,
            )
            lock_path = get_on_disk_lock_path(
                source_root,
                case_name,
                group_id,
                self.on_disk_backend,
                processed_suffix,
            )
            if self.global_rank == 0:
                with file_lock(lock_path, timeout_sec=self.data_staging_lock_timeout):
                    if not os.path.exists(src_path):
                        print(f"On-disk dataset missing at {src_path}; building on shared root.")
                        dataset = dataset_cls(group_id=group_id, **build_kwargs, log=True)
                        dataset.close()
            if dist.is_available() and dist.is_initialized():
                dist.barrier()

            if self.local_rank == 0:
                with file_lock(lock_path, timeout_sec=self.data_staging_lock_timeout):
                    stage_on_disk_group(
                        source_root=source_root,
                        stage_root=stage_root,
                        case_name=case_name,
                        group_id=group_id,
                        backend=self.on_disk_backend,
                        processed_suffix=processed_suffix,
                        log=self.global_rank == 0,
                    )
            if dist.is_available() and dist.is_initialized():
                dist.barrier()

        return stage_root

    def _stage_sharded(self, case_name, processed_suffix=None):
        if not self._use_sharded_backend():
            return self._resolve_sharded_root()

        if not self.data_staging.get("enabled", False):
            return self._resolve_sharded_root()

        stage_root = resolve_stage_root(self.data_staging)
        if not stage_root:
            if self.global_rank == 0:
                print("Warning: staging enabled but no stage root resolved; using shared root.")
            return self._resolve_sharded_root()

        source_root = self._resolve_sharded_root()
        if os.path.abspath(stage_root) == os.path.abspath(source_root):
            return source_root

        manifest_path = get_sharded_manifest_path(
            source_root,
            case_name,
            processed_suffix,
            self.sharded_manifest_name,
        )
        lock_path = get_sharded_lock_path(
            source_root,
            case_name,
            processed_suffix,
            self.sharded_manifest_name,
        )

        if self.global_rank == 0:
            with file_lock(lock_path, timeout_sec=self.data_staging_lock_timeout):
                if not os.path.exists(manifest_path):
                    raise FileNotFoundError(
                        f"Sharded manifest missing at {manifest_path}. "
                        "Run scripts/opf_build_shards.py first."
                    )
        if dist.is_available() and dist.is_initialized():
            dist.barrier()

        if self.local_rank == 0:
            with file_lock(lock_path, timeout_sec=self.data_staging_lock_timeout):
                stage_sharded_case(
                    source_root=source_root,
                    stage_root=stage_root,
                    case_name=case_name,
                    processed_suffix=processed_suffix,
                    manifest_name=self.sharded_manifest_name,
                    log=self.global_rank == 0,
                )
        if dist.is_available() and dist.is_initialized():
            dist.barrier()

        return stage_root

    def _is_on_disk_dataset_cls(self, dataset_cls) -> bool:
        try:
            return issubclass(dataset_cls, OPFOnDiskDataset)
        except TypeError:
            return False

    def _make_dataset_kwargs(self, dataset_cls, root, case_name):
        dataset_kwargs = dict(
            root=root,
            case_name=case_name,
            local_raw_folder=self.config.get("local_raw_folder"),
            force_reload=False,
        )

        if self._is_on_disk_dataset_cls(dataset_cls):
            dataset_kwargs.update(
                {
                    "backend": self.on_disk_backend,
                    "write_batch_size": self.on_disk_write_batch_size,
                    "sqlite_timeout_sec": self.on_disk_sqlite_timeout_sec,
                    "sqlite_busy_timeout_ms": self.on_disk_sqlite_busy_timeout_ms,
                    "sqlite_journal_mode": self.on_disk_sqlite_journal_mode,
                    "sqlite_synchronous": self.on_disk_sqlite_synchronous,
                }
            )

        if dataset_cls is OPFOnDiskHomogeneousDataset:
            dataset_kwargs.update(
                {
                    "processed_suffix": self.on_disk_homo_suffix,
                    "prune_homo": self.on_disk_homo_prune,
                    "storage_dtype": self.on_disk_homo_storage_dtype,
                    "restore_fp32": self.on_disk_homo_restore_fp32,
                    "attach_full_edge_attr": self.on_disk_homo_attach_full_edge_attr,
                    "sanitize_targets": self.on_disk_homo_sanitize_targets,
                    "log_bad_targets": self.on_disk_homo_log_bad_targets,
                    "max_bad_target_logs": self.on_disk_homo_max_bad_target_logs,
                }
            )
        elif dataset_cls is OPFHomogeneousDataset and self.homo_dataset_kwargs:
            dataset_kwargs.update(self.homo_dataset_kwargs)

        return dataset_kwargs

    def _log_dataset_choice(
        self,
        case_name,
        dataset_cls,
        dataset_root,
        processed_suffix=None,
        manifest_path=None,
    ):
        if self.global_rank != 0:
            return
        dataset_name = dataset_cls.__name__ if dataset_cls is not None else "OPFShardedIterableDataset"
        parts = [
            f"backend={self.dataset_backend}",
            f"dataset_cls={dataset_name}",
            f"root={dataset_root}",
        ]
        if self.dataset_backend == "on_disk":
            parts.append(f"on_disk_backend={self.on_disk_backend}")
        if processed_suffix:
            parts.append(f"processed_suffix={processed_suffix}")
        if manifest_path:
            parts.append(f"manifest={manifest_path}")
        print(f"Dataset config ({case_name}): " + ", ".join(parts))

    def _load_sharded_splits(self, case_name, group_ids):
        processed_suffix = self._sharded_processed_suffix()
        dataset_root = self._stage_sharded(case_name, processed_suffix)
        manifest_path = get_sharded_manifest_path(
            dataset_root,
            case_name,
            processed_suffix,
            self.sharded_manifest_name,
        )
        self._log_dataset_choice(
            case_name,
            OPFShardedIterableDataset,
            dataset_root,
            processed_suffix=processed_suffix,
            manifest_path=manifest_path,
        )
        if not os.path.exists(manifest_path):
            raise FileNotFoundError(
                f"Sharded manifest not found at {manifest_path}. "
                "Run scripts/opf_build_shards.py first."
            )
        manifest = load_shard_manifest(manifest_path)
        all_shards = build_shard_infos(manifest)

        splits = {}
        if "splits" in manifest:
            for split in ("train", "val", "test"):
                try:
                    split_shards = resolve_split_shards(manifest, all_shards, split)
                except KeyError:
                    split_shards = []
                split_shards = filter_shards_by_group(split_shards, group_ids)
                splits[split] = split_shards
        else:
            filtered = filter_shards_by_group(all_shards, group_ids)
            splits = split_shards_by_ratio(
                filtered,
                self.config["train_split"],
                self.config["val_split"],
                seed=self.sharded_split_seed,
                shuffle=True,
            )

        if not splits.get("train"):
            raise ValueError("Sharded dataset has no training shards after filtering.")
        splits.setdefault("val", [])
        splits.setdefault("test", [])
        return splits

    def _val_subset_seed(self, case_idx=0):
        return int(self.sharded_split_seed) + int(case_idx)

    def _maybe_limit_val_dataset(self, dataset, case_idx=0, case_label=None):
        if self.val_subset_samples <= 0:
            return dataset
        label = case_label if case_label is not None else f"case_{case_idx}"
        if isinstance(dataset, IterableDataset):
            if self.global_rank == 0:
                try:
                    total = len(dataset)
                except TypeError:
                    total = None
                if total is None:
                    print(f"Validation subset for {label}: {self.val_subset_samples} samples")
                else:
                    print(f"Validation subset for {label}: {self.val_subset_samples}/{total} samples")
            return LimitedIterableDataset(dataset, self.val_subset_samples)
        try:
            total = len(dataset)
        except TypeError:
            return dataset
        if total <= self.val_subset_samples:
            return dataset
        generator = torch.Generator().manual_seed(self._val_subset_seed(case_idx))
        indices = torch.randperm(total, generator=generator)[: self.val_subset_samples].tolist()
        indices.sort()
        if self.global_rank == 0:
            print(f"Validation subset for {label}: {self.val_subset_samples}/{total} samples")
        return Subset(dataset, indices)

    def _loader_kwargs(self, loader_config):
        num_workers = int(loader_config.get("num_workers", 0))
        kwargs = {
            "batch_size": loader_config["batch_size"],
            "num_workers": num_workers,
            "pin_memory": bool(loader_config.get("pin_memory", True)),
        }
        if num_workers > 0:
            prefetch_factor = loader_config.get("prefetch_factor")
            if prefetch_factor is not None:
                kwargs["prefetch_factor"] = int(prefetch_factor)
            persistent_workers = loader_config.get("persistent_workers")
            if persistent_workers is not None:
                kwargs["persistent_workers"] = bool(persistent_workers)
        return kwargs

    def _load_data(self):
        raise NotImplementedError

    def _create_dataloaders(self):
        raise NotImplementedError

    def _create_model(self):
        raise NotImplementedError

    def _initialize_loss_managers(self):
        raise NotImplementedError

    def _default_wandb_run_name(self):
        return f"acopf-ddp-{self.model_type}-{self.loss_type}"

    def _should_print_epoch(self):
        return self.global_rank == 0 and not self.wandb_enabled

    def _checkpoint_tag(self):
        raise NotImplementedError

    def _checkpoint_payload(self):
        return {
            "epoch": self.current_epoch,

            "config": self.config,
            "run_metadata": self.run_metadata,

            "model_class": self.model_class,
            "model_kwargs": self.model_kwargs,

            "loss_type": self.loss_type,

            "model_state_dict": self.model.module.state_dict(),
            "optimizer_state_dict": self.optimizer.state_dict(),
            "best_val_loss": self.best_val_loss,
        }

    def _on_checkpoint_saved(self, filepath):
        return

    def _build_model(self, sample_data, metadata, per_node_output_size):
        if self.global_rank == 0:
            print(f"Per-node output size: {per_node_output_size}")

        if self.model_type in HETERO_MODEL_TYPES:
            input_channels = {}
            node_types = list(metadata["nodes"].keys())
            edge_types = list(metadata["edges"].keys())
            metadata_tuple = (node_types, edge_types)

            for node_type in node_types:
                if node_type in sample_data.x_dict:
                    input_channels[node_type] = sample_data[node_type].x.shape[1]

            model_class, model_kwargs, model_config, used_fallback = build_hetero_model_spec(
                model_type=self.model_type,
                metadata=metadata_tuple,
                input_channels=input_channels,
                models_config=self.config.get("models", {}),
                out_channels=per_node_output_size,
            )
            if used_fallback and self.global_rank == 0:
                print(f"Warning: Config for {self.model_type} not found, using HeteroGNN config")
            if self.model_type in self.config["models"]:
                model_config = self.config["models"][self.model_type]
            else:
                if self.global_rank == 0:
                    print(f"Warning: Config for {self.model_type} not found, using HeteroGNN config")
                model_config = self.config["models"]["HeteroGNN"]

            if self.model_type == "HeteroGNN":
                model_class = OPFHeteroGNN
            elif self.model_type == "RGAT":
                model_class = RGAT
            elif self.model_type == "HEAT":
                model_class = HEAT
            elif self.model_type == "HGT":
                model_class = HGT

            model_kwargs = kwargs = {
                "metadata": metadata_tuple,
                "input_channels": input_channels,
                "hidden_channels": model_config["hidden_channels"],
                "out_channels": per_node_output_size,
                "num_layers": model_config["num_layers"],
                "backend": model_config.get("backend", "sage"),
            }

            if self.model_type in {"RGAT", "HGT"}:
                kwargs["num_heads"] = model_config.get("num_heads", 1)
            if self.model_type == "HGT":
                kwargs["dropout"] = model_config.get("dropout", 0.0)
            if self.model_type == "HEAT":
                kwargs["attention_heads"] = model_config.get("attention_heads", 1)

            self.model_class = f"{model_class.__module__}.{model_class.__name__}"
            self.model_kwargs = model_kwargs

            model = model_class(**model_kwargs)

            initialize_model(model, sample_data, self.device)

            if self.global_rank == 0:
                print(f"{self.model_type} Model created")
                self.model_summary = describe_model(
                    model,
                    model_type=self.model_type,
                    model_config=model_config,
                    print_fn=print,
                )
            else:
                self.model_summary = None
        else:
            homo_sample = self._get_homo_sample(sample_data)
            input_dim = homo_sample.x.shape[1]

            if self.model_type in self.config["models"]:
                model_config = self.config["models"][self.model_type]
            elif "HomoGNN" in self.config["models"]:
                model_config = self.config["models"]["HomoGNN"]
            else:
                model_config = {
                    "hidden_dim": 64,
                    "num_layers": 3,
                    "dropout": 0.1,
                    "readout": "mean",
                    "edge_dim": homo_sample.edge_attr.shape[1],
                }

            model_config["model_name"] = self.model_type
            if "edge_dim" not in model_config:
                edge_attr = getattr(homo_sample, "edge_attr", None)
                if edge_attr is None:
                    model_config["edge_dim"] = 1
                else:
                    edge_dim = edge_attr.size(-1) if edge_attr.dim() > 1 else 1
                    model_config["edge_dim"] = int(edge_dim)

            kwargs = {'input_dim': input_dim, 'output_dim': per_node_output_size, 'model_params': model_config}

            model = get_gnnNets(**kwargs)

            self.model_class = f"{model.__class__.__module__}.{model.__class__.__name__}"
            self.model_kwargs = kwargs

            initialize_model(model, homo_sample, self.device)
            if self.global_rank == 0:
                print(f"{self.model_type} Model created")
                self.model_summary = describe_model(
                    model,
                    model_type=self.model_type,
                    model_config=model_config,
                    print_fn=print,
                )
            else:
                self.model_summary = None

        model = DDP(model, device_ids=[self.local_rank], find_unused_parameters=True)
        return model

    def _get_homo_sample(self, sample_data):
        if hasattr(self, "train_loader") and self.train_loader is not None:
            try:
                return self.train_loader.dataset[0]
            except Exception:
                pass
        if hasattr(self, "train_loaders") and self.train_loaders:
            for loader in self.train_loaders.values():
                if loader is None:
                    continue
                try:
                    return loader.dataset[0]
                except Exception:
                    continue
        if hasattr(sample_data, "x") and hasattr(sample_data, "edge_index"):
            return sample_data
        return convert_opf_to_homo(sample_data)

    def _init_optimizer(self):
        optimizer_config = self.config["optimizer"]
        if "Adam" in optimizer_config:
            self.optimizer = optim.Adam(self.model.parameters(), **optimizer_config["Adam"])
        elif "AdamW" in optimizer_config:
            self.optimizer = optim.AdamW(self.model.parameters(), **optimizer_config["AdamW"])

    def _effective_global_batch_size(self):
        loader_config = self.config.get("loader", {})
        batch_size = int(loader_config.get("batch_size", 1))
        return batch_size * self.world_size * max(1, int(self.accumulate_grad_batches))

    def _train_steps_per_epoch(self):
        accumulate = max(1, int(self.accumulate_grad_batches))
        if hasattr(self, "train_loader") and self.train_loader is not None:
            try:
                total_batches = len(self.train_loader)
            except Exception:
                total_batches = None
            if total_batches is not None:
                return int(math.ceil(total_batches / accumulate))
        if hasattr(self, "train_loaders") and self.train_loaders:
            sequential = self.case_mix_every_n_steps <= 0 or len(self.train_loaders) <= 1
            total_steps = 0
            total_batches = 0
            found = False
            for loader in self.train_loaders.values():
                if loader is None:
                    continue
                try:
                    num_batches = len(loader)
                except Exception:
                    return None
                found = True
                if sequential:
                    total_steps += int(math.ceil(num_batches / accumulate))
                else:
                    total_batches += num_batches
            if not found:
                return None
            if sequential:
                return total_steps
            return int(math.ceil(total_batches / accumulate))
        return None

    def _infer_scheduler_t_max(self):
        training_config = self.config.get("training", {})
        max_global_samples = training_config.get("max_global_samples")
        try:
            max_global_samples = int(max_global_samples) if max_global_samples is not None else 0
        except (TypeError, ValueError):
            max_global_samples = 0
        if max_global_samples > 0:
            global_batch_size = self._effective_global_batch_size()
            if global_batch_size > 0:
                return int(math.ceil(max_global_samples / global_batch_size))

        max_epochs = training_config.get("max_epochs")
        try:
            max_epochs = int(max_epochs) if max_epochs is not None else 0
        except (TypeError, ValueError):
            max_epochs = 0
        if max_epochs <= 0:
            return None
        steps_per_epoch = self._train_steps_per_epoch()
        if steps_per_epoch is None or steps_per_epoch <= 0:
            return None
        return int(max_epochs * steps_per_epoch)

    def _init_scheduler(self):
        self.scheduler = None
        scheduler_config = self.config.get("scheduler")
        if not isinstance(scheduler_config, dict):
            return
        sched_type = str(scheduler_config.get("type", "")).strip().lower()
        if not sched_type or sched_type in {"none", "null", "false"}:
            return
        if sched_type not in {"cosine", "cosineannealing", "cosineannealinglr"}:
            if self.global_rank == 0:
                print(f"Warning: Unsupported scheduler type '{sched_type}'. Skipping scheduler.")
            return

        t_max = scheduler_config.get("t_max")
        if t_max is None:
            t_max = self._infer_scheduler_t_max()
            if t_max is None or t_max <= 0:
                if self.global_rank == 0:
                    print("Warning: scheduler.t_max not set and could not infer; skipping scheduler.")
                return
            if self.global_rank == 0:
                print(f"Setting scheduler.t_max={t_max} from training config.")
        try:
            t_max = int(t_max)
        except (TypeError, ValueError):
            if self.global_rank == 0:
                print("Warning: scheduler.t_max must be an integer; skipping scheduler.")
            return
        if t_max <= 0:
            if self.global_rank == 0:
                print("Warning: scheduler.t_max must be > 0; skipping scheduler.")
            return

        eta_min = scheduler_config.get("eta_min", 0.0)
        try:
            eta_min = float(eta_min)
        except (TypeError, ValueError):
            eta_min = 0.0

        self.scheduler = optim.lr_scheduler.CosineAnnealingLR(
            self.optimizer,
            T_max=t_max,
            eta_min=eta_min,
        )
        if self.global_rank == 0:
            print(f"LR scheduler initialized: cosine (t_max={t_max}, eta_min={eta_min})")

    def _clip_gradients(self):
        if self.grad_clip_val is None or self.grad_clip_val <= 0:
            return

        parameters = [p for p in self.model.parameters() if p.requires_grad]
        if self.grad_clip_algo == "value":
            torch.nn.utils.clip_grad_value_(parameters, self.grad_clip_val)
        else:
            torch.nn.utils.clip_grad_norm_(
                parameters,
                self.grad_clip_val,
                error_if_nonfinite=True,
            )

    def _handle_nonfinite_loss(self, loss, batch_idx, case_name=None):
        if torch.isfinite(loss).all():
            return False
        context = f"rank={self.global_rank} step={self.global_step} batch={batch_idx}"
        if case_name is not None:
            context = f"{context} case={case_name}"
        message = f"Non-finite loss detected ({context}); skipping optimizer step."
        if self.fail_on_nonfinite:
            raise RuntimeError(message)
        print(f"Warning: {message}")
        self.nonfinite_loss_skips += 1
        self.optimizer.zero_grad()
        return True

    def _ensure_finite_gradients(self, batch_idx, case_name=None):
        context = f"rank={self.global_rank} step={self.global_step} batch={batch_idx}"
        if case_name is not None:
            context = f"{context} case={case_name}"

        try:
            self._clip_gradients()
        except RuntimeError as error:
            message = f"Non-finite gradients detected during clipping ({context}): {error}"
            if self.fail_on_nonfinite:
                raise RuntimeError(message) from error
            print(f"Warning: {message}")
            self.nonfinite_grad_skips += 1
            self.optimizer.zero_grad()
            return False

        if self.grad_clip_val is None or self.grad_clip_val <= 0:
            for parameter in self.model.parameters():
                if not parameter.requires_grad or parameter.grad is None:
                    continue
                if not torch.isfinite(parameter.grad).all():
                    message = f"Non-finite gradients detected without clipping ({context})."
                    if self.fail_on_nonfinite:
                        raise RuntimeError(message)
                    print(f"Warning: {message}")
                    self.nonfinite_grad_skips += 1
                    self.optimizer.zero_grad()
                    return False
        return True

    def _init_wandb(self):
        if self.wandb_requested is False:
            return
        if not WANDB_AVAILABLE:
            if self.global_rank == 0 and self.wandb_requested:
                print("Warning: Weights & Biases is not available. Install wandb or omit --wandb.")
            return
        if self.global_rank != 0:
            return
        if wandb.run is not None:
            self.wandb_run = wandb.run
            self.wandb_enabled = True
            try:
                wandb.config.update(self.config, allow_val_change=True)
            except Exception:
                pass
            if self.run_metadata:
                try:
                    wandb.config.update({"run_metadata": self.run_metadata}, allow_val_change=True)
                except Exception:
                    pass
            self._log_model_summary()
            return
        logging_dir = self.config.get("logging_dir")
        run_name = self.wandb_run_name or self._default_wandb_run_name()
        try:
            wandb_kwargs = {
                "project": self.wandb_project,
                "name": run_name,
                "dir": logging_dir,
                "config": self.config,
                "group": self.wandb_group_name,
            }
            if self.wandb_entity:
                wandb_kwargs["entity"] = self.wandb_entity
            self.wandb_run = wandb.init(**wandb_kwargs)
            self.wandb_enabled = True
            if self.run_metadata:
                try:
                    wandb.config.update({"run_metadata": self.run_metadata}, allow_val_change=True)
                except Exception:
                    pass
            self._log_model_summary()
        except Exception as exc:
            print(f"Warning: W&B init failed: {exc}")
            self.wandb_run = None
            self.wandb_enabled = False

    def _log_model_summary(self):
        if not self.model_summary or self.wandb_run is None:
            return
        try:
            self.wandb_run.summary["model_summary"] = self.model_summary
        except Exception:
            pass
        try:
            wandb.config.update({"model_summary": self.model_summary}, allow_val_change=True)
        except Exception:
            pass

    def _should_log_step(self):
        if not self.wandb_enabled:
            return False
        if self.log_every_n_samples and self.log_every_n_samples > 0:
            return self.global_samples >= self._next_log_samples
        if self.log_every_n_steps and self.log_every_n_steps > 0:
            return self.global_step % self.log_every_n_steps == 0
        return True

    def _log_wandb_step(self, loss_value, loss_info):
        if not self._should_log_step():
            return
        metrics = {
            "train/loss/total": self._as_float(loss_value),
            "train/samples_seen": int(self.global_samples),
        }
        lr = self._current_lr()
        if lr is not None:
            metrics["train/lr"] = lr
        for info_key, metric_name in self.train_metric_map:
            if info_key in loss_info:
                metric_value = self._as_float(loss_info[info_key])
                if metric_value is not None:
                    metrics[metric_name] = metric_value
        wandb.log(metrics, step=self.global_samples)
        if self._next_log_samples is not None and self.log_every_n_samples > 0:
            while self.global_samples >= self._next_log_samples:
                self._next_log_samples += self.log_every_n_samples

    def _log_wandb_validation(self, metric_avgs):
        if not self.wandb_enabled or not metric_avgs:
            return
        metrics = dict(metric_avgs)
        wandb.log(metrics, step=self.global_samples)

    def _sync_for_timing(self):
        if self.device.type == "cuda":
            torch.cuda.synchronize(self.device)
        elif hasattr(torch, "accelerator") and hasattr(torch.accelerator, "synchronize"):
            torch.accelerator.synchronize()

    def _should_time_validation_batch(self, batch_idx, timed_batches):
        if not self.validation_timing:
            return False
        if self.validation_timing_every_n_batches > 1:
            if batch_idx % self.validation_timing_every_n_batches != 0:
                return False
        if self.validation_timing_max_batches and timed_batches >= self.validation_timing_max_batches:
            return False
        return True

    def _as_float(self, value):
        if value is None:
            return None
        if torch.is_tensor(value):
            if value.numel() == 1:
                return value.detach().item()
            return value.detach().float().mean().item()
        if isinstance(value, np.ndarray):
            return float(value.mean())
        try:
            return float(value)
        except (TypeError, ValueError):
            return None

    def _current_lr(self):
        if not hasattr(self, "optimizer") or self.optimizer is None:
            return None
        if not self.optimizer.param_groups:
            return None
        lr_value = self.optimizer.param_groups[0].get("lr")
        try:
            return float(lr_value)
        except (TypeError, ValueError):
            return None

    def _init_metric_trackers(self, metric_names):
        metric_sums = {name: 0.0 for name in metric_names}
        metric_counts = {name: 0.0 for name in metric_names}
        return metric_sums, metric_counts

    def _get_batch_samples(self, batch):
        if hasattr(batch, "num_graphs"):
            return int(batch.num_graphs)
        if torch.is_tensor(batch):
            return int(batch.size(0))
        if hasattr(batch, "x") and torch.is_tensor(batch.x):
            return int(batch.x.size(0))
        loader_config = self.config.get("loader", {})
        return int(loader_config.get("batch_size", 1))

    def _train_samples_per_epoch(self):
        if hasattr(self, "train_sampler") and self.train_sampler is not None:
            total_size = getattr(self.train_sampler, "total_size", None)
            if total_size is not None:
                return int(total_size)
            num_samples = getattr(self.train_sampler, "num_samples", None)
            if num_samples is not None:
                return int(num_samples) * self.world_size

        total = 0
        if hasattr(self, "train_samplers") and self.train_samplers:
            for sampler in self.train_samplers.values():
                if sampler is None:
                    continue
                total_size = getattr(sampler, "total_size", None)
                if total_size is not None:
                    total += int(total_size)
                else:
                    num_samples = getattr(sampler, "num_samples", None)
                    if num_samples is not None:
                        total += int(num_samples) * self.world_size
            if total > 0:
                return total

        if hasattr(self, "train_loader"):
            try:
                return int(len(self.train_loader.dataset))
            except Exception:
                pass
        if hasattr(self, "train_loaders"):
            for loader in self.train_loaders.values():
                try:
                    total += int(len(loader.dataset))
                except Exception:
                    continue
            if total > 0:
                return total
        return 0

    def _init_sample_schedules(self):
        samples_per_epoch = self._train_samples_per_epoch()
        if self.val_every_n_samples <= 0 and self.val_every_n_epochs > 0 and samples_per_epoch > 0:
            self.val_every_n_samples = int(self.val_every_n_epochs * samples_per_epoch)
        if self.val_every_n_samples > 0:
            self._next_val_samples = self.val_every_n_samples
        else:
            self._next_val_samples = None

        if self.ckpt_every_n_samples <= 0 and self.ckpt_every_n_epochs > 0 and samples_per_epoch > 0:
            self.ckpt_every_n_samples = int(self.ckpt_every_n_epochs * samples_per_epoch)
        if self.ckpt_every_n_samples > 0:
            self._next_ckpt_samples = self.ckpt_every_n_samples
        else:
            self._next_ckpt_samples = None

    def _update_global_samples(self, batch):
        batch_samples = self._get_batch_samples(batch)
        self.global_samples += batch_samples * self.world_size
        return batch_samples

    def _maybe_stop_by_samples(self):
        if self.max_global_samples <= 0:
            return False
        should_stop = self.global_samples >= self.max_global_samples
        if dist.is_available() and dist.is_initialized() and self.world_size > 1:
            stop_tensor = torch.tensor(int(should_stop), device=self.device)
            dist.all_reduce(stop_tensor, op=dist.ReduceOp.MAX)
            should_stop = bool(stop_tensor.item())
        if should_stop:
            self.stop_training = True
        return should_stop

    def _sync_batch_samples(self, batch_samples):
        if not dist.is_available() or not dist.is_initialized() or self.world_size <= 1:
            return int(batch_samples)
        sample_tensor = torch.tensor(int(batch_samples), device=self.device)
        dist.all_reduce(sample_tensor, op=dist.ReduceOp.SUM)
        return int(sample_tensor.item())

    def _sync_trigger(self, should_run):
        if dist.is_available() and dist.is_initialized() and self.world_size > 1:
            flag = torch.tensor(int(should_run), device=self.device)
            dist.all_reduce(flag, op=dist.ReduceOp.MAX)
            return bool(flag.item())
        return should_run

    def _advance_next_samples(self, next_samples, interval):
        if interval <= 0:
            return None
        if next_samples is None:
            next_samples = interval
        while next_samples <= self.global_samples:
            next_samples += interval
        return next_samples

    def _maybe_run_validation(self):
        if self.val_every_n_samples <= 0 or self._next_val_samples is None:
            return False
        should_run = self.global_samples >= self._next_val_samples
        should_run = self._sync_trigger(should_run)
        if not should_run:
            return False
        self._next_val_samples = self._advance_next_samples(
            self._next_val_samples,
            self.val_every_n_samples,
        )
        self._run_validation()
        return True

    def _run_validation(self):
        val_loss, val_task_loss, val_metrics = self.validate()
        if self.wandb_enabled:
            self._log_wandb_validation(val_metrics)
        if val_loss is None:
            self.model.train()
            return None, None, None

        if self._should_print_epoch():
            print(f"\nValidation @ samples {self.global_samples} (epoch {self.current_epoch}):")
            print(f"  Val Loss: {val_loss:.4f}, Val Task: {val_task_loss:.4f}")
            if val_metrics:
                self._print_metric_groups("  Val Metrics:", val_metrics, self.val_metric_groups)

        if val_loss < self.best_val_loss:
            self.best_val_loss = val_loss
            self.patience_counter = 0
            checkpoint_path = os.path.join(
                self.checkpoint_dir,
                f"best-{self._checkpoint_tag()}-epoch{self.current_epoch:02d}-val{val_loss:.4f}.pt",
            )
            self.save_checkpoint(checkpoint_path)
        else:
            self.patience_counter += 1
            if self.global_rank == 0:
                print(f"  No improvement. Patience: {self.patience_counter}/{self.patience}")
            if self.patience_counter >= self.patience:
                if self.global_rank == 0:
                    print(f"\nEarly stopping triggered after {self.current_epoch + 1} epochs")
                self.stop_training = True

        self.model.train()
        return val_loss, val_task_loss, val_metrics

    def _maybe_save_periodic_checkpoint(self):
        if self.ckpt_every_n_samples <= 0 or self._next_ckpt_samples is None:
            return False
        should_save = self.global_samples >= self._next_ckpt_samples
        should_save = self._sync_trigger(should_save)
        if not should_save:
            return False
        self._next_ckpt_samples = self._advance_next_samples(
            self._next_ckpt_samples,
            self.ckpt_every_n_samples,
        )
        self._save_periodic_checkpoint()
        if dist.is_available() and dist.is_initialized() and self.world_size > 1:
            dist.barrier()
        return True

    def _save_periodic_checkpoint(self):
        if self.global_rank != 0:
            return
        checkpoint_path = os.path.join(
            self.checkpoint_dir,
            f"checkpoint-{self._checkpoint_tag()}-samples{self.global_samples}.pt",
        )
        self.save_checkpoint(checkpoint_path)

    def _add_metric(self, metric_sums, metric_counts, name, value, weight=1.0):
        numeric_value = self._as_float(value)
        if numeric_value is None:
            return
        metric_sums[name] += numeric_value * weight
        metric_counts[name] += weight

    def _reduce_metrics(self, metric_sums, metric_counts):
        if dist.is_available() and dist.is_initialized() and self.world_size > 1:
            for name in metric_sums:
                sum_tensor = torch.tensor(metric_sums[name], device=self.device)
                count_tensor = torch.tensor(metric_counts[name], device=self.device)
                dist.all_reduce(sum_tensor, op=dist.ReduceOp.SUM)
                dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM)
                metric_sums[name] = sum_tensor.item()
                metric_counts[name] = count_tensor.item()
        return metric_sums, metric_counts

    def _compute_metric_avgs(self, metric_sums, metric_counts):
        metric_avgs = {}
        for name, total in metric_sums.items():
            count = metric_counts.get(name, 0.0)
            if count > 0:
                metric_avgs[name] = total / count
        return metric_avgs

    def _add_val_score(self, metric_avgs):
        if not metric_avgs:
            return
        objective = metric_avgs.get("val/loss/objective")
        if objective is None:
            objective = 0.0
        metric_avgs["val/score"] = objective

    def _violation_eval_disabled(self):
        return self.violation_eval_p <= 0.0

    def _make_violation_eval_rng(self, case_idx=0):
        seed = self.violation_eval_seed
        if seed is None:
            return np.random.RandomState()
        offset = int(self.global_samples) + int(self.current_epoch) * 1000 + int(case_idx) * 100000
        seed = (int(seed) + offset) % (2**32)
        return np.random.RandomState(seed)

    def _should_eval_violations(self, rng, batch_idx, total_batches, eval_batches, min_eval_batches):
        if total_batches is not None and total_batches > 0:
            remaining_batches = total_batches - batch_idx
            remaining_needed = min_eval_batches - eval_batches
            if remaining_needed > 0 and remaining_batches <= remaining_needed:
                return True
        return rng.random_sample() < self.violation_eval_p

    def _print_metric_groups(self, title, metric_avgs, groups):
        if not metric_avgs:
            return
        print(title)
        for names in groups:
            parts = []
            for name in names:
                if name in metric_avgs:
                    parts.append(f"{name}: {metric_avgs[name]:.4f}")
            if parts:
                print("    " + ", ".join(parts))

    def forward(self, batch):
        """Run a forward pass through the model.

        Handles both heterogeneous and homogeneous model types, performing
        the appropriate input conversion.

        Args:
            batch: PyG batch object on the training device.

        Returns:
            dict[str, torch.Tensor]: Per-node-type prediction tensors.
        """
        if self.model_type in HETERO_MODEL_TYPES:
            x_dict = {k: v.float() for k, v in batch.x_dict.items()}
            return self.model.module(x_dict, batch.edge_index_dict, minmax_scaling=self.minmax_scaling)
        if isinstance(batch, torch.Tensor) or hasattr(batch, "node_type"):
            homo_batch = batch
        else:
            homo_batch = convert_opf_to_homo(batch)
            homo_batch = homo_batch.to(self.device)

        if hasattr(homo_batch, "x"):
            homo_batch.x = homo_batch.x.float()
        if hasattr(homo_batch, "edge_attr") and homo_batch.edge_attr is not None:
            homo_batch.edge_attr = homo_batch.edge_attr.float()

        homo_output = self.model.module(homo_batch)

        predictions = {}
        node_types = ["bus", "generator", "load", "shunt"]
        for i, node_type in enumerate(node_types):
            mask = homo_batch.node_type == i
            if mask.any():
                predictions[node_type] = homo_output[mask]
        return predictions

    def save_checkpoint(self, filepath):
        """Save a training checkpoint to disk (rank-0 only).

        Args:
            filepath (str): Destination path for the ``.pt`` checkpoint file.
        """
        if self.global_rank == 0:
            checkpoint = self._checkpoint_payload()
            torch.save(checkpoint, filepath)
            print(f"Checkpoint saved to {filepath}")
            self._on_checkpoint_saved(filepath)

    def train(self):
        """Run the full training loop across all epochs.

        Iterates for ``max_epochs`` epochs, calling ``train_epoch`` each
        iteration.  Handles early stopping via patience, sample-budget
        limits, periodic validation, checkpoint saving, throughput
        measurement finalization, and W&B cleanup.
        """
        checkpoint_dir = self.checkpoint_dir
        if self.global_rank == 0:
            os.makedirs(checkpoint_dir, exist_ok=True)
        if self.throughput_tracker:
            self.throughput_tracker.write_metadata()

        for epoch in range(self.max_epochs):
            self.current_epoch = epoch

            train_loss, train_task_loss, train_metrics = self.train_epoch(epoch)

            if self._should_print_epoch():
                print(f"\nEpoch {epoch}:")
                print(f"  Train Loss: {train_loss:.4f}, Train Task: {train_task_loss:.4f}")
                self._print_metric_groups("  Train Metrics:", train_metrics, self.train_metric_groups)
            if self.stop_training:
                if self.max_global_samples > 0 and self.global_samples >= self.max_global_samples:
                    if self.global_rank == 0:
                        print(
                            f"\nStopping after reaching max_global_samples={self.max_global_samples} "
                            f"(global_samples={self.global_samples})"
                        )
                dist.barrier()
                break

            dist.barrier()

        if self.throughput_tracker:
            self.throughput_tracker.finalize(partial=True)

        if self.wandb_enabled and wandb is not None:
            wandb.finish()

`forward(batch)` ¶

Run a forward pass through the model.

Handles both heterogeneous and homogeneous model types, performing the appropriate input conversion.

Parameters:

Name	Type	Description	Default
`batch`		PyG batch object on the training device.	required

Returns:

Type	Description
	dict[str, torch.Tensor]: Per-node-type prediction tensors.

Source code in lumina/trainer/opf/trainer.py

def forward(self, batch):
    """Run a forward pass through the model.

    Handles both heterogeneous and homogeneous model types, performing
    the appropriate input conversion.

    Args:
        batch: PyG batch object on the training device.

    Returns:
        dict[str, torch.Tensor]: Per-node-type prediction tensors.
    """
    if self.model_type in HETERO_MODEL_TYPES:
        x_dict = {k: v.float() for k, v in batch.x_dict.items()}
        return self.model.module(x_dict, batch.edge_index_dict, minmax_scaling=self.minmax_scaling)
    if isinstance(batch, torch.Tensor) or hasattr(batch, "node_type"):
        homo_batch = batch
    else:
        homo_batch = convert_opf_to_homo(batch)
        homo_batch = homo_batch.to(self.device)

    if hasattr(homo_batch, "x"):
        homo_batch.x = homo_batch.x.float()
    if hasattr(homo_batch, "edge_attr") and homo_batch.edge_attr is not None:
        homo_batch.edge_attr = homo_batch.edge_attr.float()

    homo_output = self.model.module(homo_batch)

    predictions = {}
    node_types = ["bus", "generator", "load", "shunt"]
    for i, node_type in enumerate(node_types):
        mask = homo_batch.node_type == i
        if mask.any():
            predictions[node_type] = homo_output[mask]
    return predictions

`save_checkpoint(filepath)` ¶

Save a training checkpoint to disk (rank-0 only).

Parameters:

Name	Type	Description	Default
`filepath`	`str`	Destination path for the `.pt` checkpoint file.	required

Source code in lumina/trainer/opf/trainer.py

def save_checkpoint(self, filepath):
    """Save a training checkpoint to disk (rank-0 only).

    Args:
        filepath (str): Destination path for the ``.pt`` checkpoint file.
    """
    if self.global_rank == 0:
        checkpoint = self._checkpoint_payload()
        torch.save(checkpoint, filepath)
        print(f"Checkpoint saved to {filepath}")
        self._on_checkpoint_saved(filepath)

`train()` ¶

Run the full training loop across all epochs.

Iterates for max_epochs epochs, calling train_epoch each iteration. Handles early stopping via patience, sample-budget limits, periodic validation, checkpoint saving, throughput measurement finalization, and W&B cleanup.

Source code in lumina/trainer/opf/trainer.py

def train(self):
    """Run the full training loop across all epochs.

    Iterates for ``max_epochs`` epochs, calling ``train_epoch`` each
    iteration.  Handles early stopping via patience, sample-budget
    limits, periodic validation, checkpoint saving, throughput
    measurement finalization, and W&B cleanup.
    """
    checkpoint_dir = self.checkpoint_dir
    if self.global_rank == 0:
        os.makedirs(checkpoint_dir, exist_ok=True)
    if self.throughput_tracker:
        self.throughput_tracker.write_metadata()

    for epoch in range(self.max_epochs):
        self.current_epoch = epoch

        train_loss, train_task_loss, train_metrics = self.train_epoch(epoch)

        if self._should_print_epoch():
            print(f"\nEpoch {epoch}:")
            print(f"  Train Loss: {train_loss:.4f}, Train Task: {train_task_loss:.4f}")
            self._print_metric_groups("  Train Metrics:", train_metrics, self.train_metric_groups)
        if self.stop_training:
            if self.max_global_samples > 0 and self.global_samples >= self.max_global_samples:
                if self.global_rank == 0:
                    print(
                        f"\nStopping after reaching max_global_samples={self.max_global_samples} "
                        f"(global_samples={self.global_samples})"
                    )
            dist.barrier()
            break

        dist.barrier()

    if self.throughput_tracker:
        self.throughput_tracker.finalize(partial=True)

    if self.wandb_enabled and wandb is not None:
        wandb.finish()

`OPFTrainer` ¶

Bases: BaseOPFTrainer

Single-case OPF trainer for DDP training on one power-grid topology.

Manages a single dataset (one case name with one or more data groups), creates a single loss manager, and provides train_epoch / validate implementations that iterate over one train/val loader pair.

Parameters:

Name	Type	Description	Default
`config`	`dict`	Full training configuration (parsed YAML).	required
`case_name`	`str`	Fully-qualified PGLib case name.	required
`group_ids`	`list[int] \| int`	Data group identifiers to load.	required
`model_type`	`str`	Model architecture identifier.	required
`loss_type`	`str`	Loss function name.	`'mse'`
`minmax_scaling`	`bool`	Whether to apply min-max scaling.	`True`
`local_rank`	`int`	Local GPU rank for DDP.	`0`
`global_rank`	`int`	Global process rank for DDP.	`0`
`world_size`	`int`	Total number of DDP processes.	`1`
`wandb_run_name`	`str`	Custom W&B run name.	`None`
`wandb_group_name`	`str`	W&B run group.	`None`
`wandb_requested`	`bool`	Whether W&B logging was requested.	`False`
`wandb_project`	`str`	W&B project name.	`'lumina-training'`
`wandb_entity`	`str`	W&B entity/team name.	`None`
`run_metadata`	`dict`	Extra metadata to log with the run.	`None`

Source code in lumina/trainer/opf/trainer.py

class OPFTrainer(BaseOPFTrainer):
    """Single-case OPF trainer for DDP training on one power-grid topology.

    Manages a single dataset (one case name with one or more data groups),
    creates a single loss manager, and provides ``train_epoch`` /
    ``validate`` implementations that iterate over one train/val loader pair.

    Args:
        config (dict): Full training configuration (parsed YAML).
        case_name (str): Fully-qualified PGLib case name.
        group_ids (list[int] | int): Data group identifiers to load.
        model_type (str): Model architecture identifier.
        loss_type (str): Loss function name.
        minmax_scaling (bool): Whether to apply min-max scaling.
        local_rank (int): Local GPU rank for DDP.
        global_rank (int): Global process rank for DDP.
        world_size (int): Total number of DDP processes.
        wandb_run_name (str, optional): Custom W&B run name.
        wandb_group_name (str, optional): W&B run group.
        wandb_requested (bool): Whether W&B logging was requested.
        wandb_project (str): W&B project name.
        wandb_entity (str, optional): W&B entity/team name.
        run_metadata (dict, optional): Extra metadata to log with the run.
    """

    def __init__(
        self,
        config,
        case_name,
        group_ids,
        model_type,
        loss_type="mse",
        minmax_scaling=True,
        local_rank=0,
        global_rank=0,
        world_size=1,
        wandb_run_name=None,
        wandb_group_name=None,
        wandb_requested=False,
        wandb_project="lumina-training",
        wandb_entity=None,
        run_metadata=None,
    ):
        self.case_name = case_name
        self.group_ids = _normalize_group_ids(group_ids)
        if not self.group_ids:
            raise ValueError("group_ids must contain at least one group id.")
        super().__init__(
            config=config,
            model_type=model_type,
            loss_type=loss_type,
            minmax_scaling=minmax_scaling,
            local_rank=local_rank,
            global_rank=global_rank,
            world_size=world_size,
            wandb_run_name=wandb_run_name,
            wandb_group_name=wandb_group_name,
            wandb_requested=wandb_requested,
            wandb_project=wandb_project,
            wandb_entity=wandb_entity,
            run_metadata=run_metadata,
        )

    def _load_data(self):
        if self._use_sharded_backend():
            self.sharded_splits = self._load_sharded_splits(self.case_name, self.group_ids)
            if self.global_rank == 0:
                counts = {
                    split: sum(shard.num_samples for shard in shards)
                    for split, shards in self.sharded_splits.items()
                }
                print(
                    "Sharded dataset loaded: "
                    f"train={counts.get('train', 0)}, "
                    f"val={counts.get('val', 0)}, "
                    f"test={counts.get('test', 0)} samples"
                )
            return
        dataset_cls = self._select_dataset_cls()
        build_kwargs = self._make_dataset_kwargs(dataset_cls, self.config["root"], self.case_name)
        processed_suffix = self.on_disk_homo_suffix if dataset_cls is OPFOnDiskHomogeneousDataset else None
        dataset_root = self._stage_on_disk(
            self.case_name,
            self.group_ids,
            dataset_cls,
            build_kwargs,
            processed_suffix,
        )
        self._log_dataset_choice(self.case_name, dataset_cls, dataset_root, processed_suffix=processed_suffix)
        dataset_kwargs = dict(build_kwargs)
        dataset_kwargs["root"] = dataset_root

        def build_dataset():
            if len(self.group_ids) == 1:
                return dataset_cls(group_id=self.group_ids[0], **dataset_kwargs)
            return OPFMultiDataset.from_case_groups(
                group_ids=self.group_ids,
                dataset_cls=dataset_cls,
                **dataset_kwargs,
            )

        if dist.is_available() and dist.is_initialized() and self.world_size > 1:
            if self.global_rank == 0:
                self.dataset = build_dataset()
            dist.barrier()
            if self.global_rank != 0:
                self.dataset = build_dataset()
        else:
            self.dataset = build_dataset()
        if self.global_rank == 0:
            print(f"Dataset loaded: {len(self.dataset)} samples")

    def _create_dataloaders(self):
        if self._use_sharded_backend():
            loader_config = self.config["loader"]
            if self.model_type not in HETERO_MODEL_TYPES and not self.use_precomputed_homo:
                raise ValueError(
                    "Sharded backend requires precomputed homogeneous shards when using homo models. "
                    "Set use_precomputed_homo=true or switch backend."
                )
            case_id = 0
            self.train_dataset = CaseTaggedIterableDataset(
                OPFShardedIterableDataset(
                    self.sharded_splits["train"],
                    shuffle_shards=loader_config["shuffle"],
                    seed=self.sharded_split_seed,
                ),
                case_id,
            )
            self.val_dataset = CaseTaggedIterableDataset(
                OPFShardedIterableDataset(
                    self.sharded_splits.get("val", []),
                    shuffle_shards=False,
                    seed=self.sharded_split_seed,
                ),
                case_id,
            )
            self.val_dataset = self._maybe_limit_val_dataset(
                self.val_dataset,
                case_idx=case_id,
                case_label=self.case_name,
            )
            self.test_dataset = CaseTaggedIterableDataset(
                OPFShardedIterableDataset(
                    self.sharded_splits.get("test", []),
                    shuffle_shards=False,
                    seed=self.sharded_split_seed,
                ),
                case_id,
            )
            self.train_sampler = None
            self.val_sampler = None
            self.train_loader = DataLoader(self.train_dataset, **self._loader_kwargs(loader_config))
            self.val_loader = DataLoader(self.val_dataset, **self._loader_kwargs(loader_config))
            self.test_loader = DataLoader(self.test_dataset, **self._loader_kwargs(loader_config))
            self.dataset = self.train_dataset
            return

        n_samples = len(self.dataset)
        n_train = int(n_samples * self.config["train_split"])
        n_val = int(n_samples * self.config["val_split"])

        train_dataset = torch.utils.data.Subset(self.dataset, range(n_train))
        val_dataset = torch.utils.data.Subset(self.dataset, range(n_train, n_train + n_val))
        test_dataset = torch.utils.data.Subset(self.dataset, range(n_train + n_val, n_samples))

        if self.model_type not in HETERO_MODEL_TYPES and not self.use_precomputed_homo:
            train_dataset = HomoOPFDataset(train_dataset)
            val_dataset = HomoOPFDataset(val_dataset)
            test_dataset = HomoOPFDataset(test_dataset)

        case_id = 0
        train_dataset = CaseTaggedDataset(train_dataset, case_id)
        val_dataset = CaseTaggedDataset(val_dataset, case_id)
        test_dataset = CaseTaggedDataset(test_dataset, case_id)
        val_dataset = self._maybe_limit_val_dataset(
            val_dataset,
            case_idx=case_id,
            case_label=self.case_name,
        )

        loader_config = self.config["loader"]

        self.train_sampler = DistributedSampler(
            train_dataset,
            num_replicas=self.world_size,
            rank=self.global_rank,
            shuffle=loader_config["shuffle"],
        )

        self.val_sampler = DistributedSampler(
            val_dataset,
            num_replicas=self.world_size,
            rank=self.global_rank,
            shuffle=False,
        )

        self.train_loader = DataLoader(
            train_dataset,
            sampler=self.train_sampler,
            **self._loader_kwargs(loader_config),
        )

        self.val_loader = DataLoader(
            val_dataset,
            sampler=self.val_sampler,
            **self._loader_kwargs(loader_config),
        )

        self.test_loader = DataLoader(
            test_dataset,
            shuffle=False,
            **self._loader_kwargs(loader_config),
        )

    def _create_model(self):
        if self._use_sharded_backend():
            sample_data = self.train_dataset.peek()
            metadata = self.train_dataset.metadata() if self.model_type in HETERO_MODEL_TYPES else None
        else:
            sample_data = self.dataset[0]
            metadata = self.dataset.metadata() if self.model_type in HETERO_MODEL_TYPES else None
        per_node_output_size = self._infer_output_dim(sample_data)
        self.model = self._build_model(sample_data, metadata, per_node_output_size)

    def _initialize_loss_managers(self):
        self.loss_manager = OPFLossManager(
            loss_type=self.loss_type,
            device=self.device,
            log_normalized_violation=self.log_normalized_violation,
        )

        if self.global_rank == 0:
            print(f"Loss Manager initialized with loss_type='{self.loss_type}'")

    def _checkpoint_tag(self):
        return self.case_name

    def _on_checkpoint_saved(self, filepath):
        if self.wandb_enabled and self.wandb_run is not None:
            try:
                self.wandb_run.summary["best_model_path"] = filepath
            except Exception:
                pass

    def train_epoch(self, epoch):
        """Execute one training epoch over the single-case dataset.

        Iterates through all batches in the training loader, performing
        forward/backward passes with gradient accumulation, non-finite
        loss/gradient handling, optional throughput tracking, and mid-epoch
        validation/checkpointing when sample-based schedules are active.

        Args:
            epoch (int): Current epoch index (used for sampler seeding and
                progress display).

        Returns:
            tuple[float, float, dict]: ``(avg_loss, avg_task_loss, metric_avgs)``
                aggregated across all DDP ranks.
        """
        self.model.train()
        if self.train_sampler is not None:
            self.train_sampler.set_epoch(epoch)
        elif hasattr(self.train_loader.dataset, "set_epoch"):
            self.train_loader.dataset.set_epoch(epoch)

        total_loss = 0.0
        total_task_loss = 0.0
        num_batches = 0
        total_steps = len(self.train_loader)
        metric_sums, metric_counts = self._init_metric_trackers(self.train_metric_names)
        step_start_time = None
        step_samples = 0
        accum_batches = 0
        tracker = self.throughput_tracker

        if self.global_rank == 0 and not self.wandb_enabled:
            pbar = tqdm(self.train_loader, desc=f"Epoch {epoch}")
        else:
            pbar = self.train_loader

        self.optimizer.zero_grad()

        for batch_idx, batch in enumerate(pbar):
            is_step_start = accum_batches == 0
            if is_step_start:
                if tracker:
                    tracker.maybe_start_measurement()
                    if tracker.measure_active():
                        tracker.accelerator_synchronize()
                        step_start_time = time.perf_counter()
                        step_samples = 0

            batch = batch.to(self.device)
            batch_samples = self._update_global_samples(batch)

            if tracker and tracker.measure_active():
                step_samples += tracker.get_batch_samples(batch)

            predictions = self.forward(batch)
            loss, loss_info = self.loss_manager.compute_loss(
                predictions,
                batch,
                return_info=True,
            )

            loss_value = loss.item()
            self._add_metric(metric_sums, metric_counts, "train/loss/total", loss_value)
            if self.log_every_n_samples and self.log_every_n_samples > 0:
                self._log_wandb_step(loss_value, loss_info)
            loss = loss / self.accumulate_grad_batches
            if self._handle_nonfinite_loss(loss, batch_idx):
                accum_batches = 0
                self._add_metric(metric_sums, metric_counts, "train/perf/nonfinite_loss_skips", 1.0)
                continue

            loss.backward()

            should_step = ((batch_idx + 1) % self.accumulate_grad_batches == 0) or (
                (batch_idx + 1) == total_steps
            )

            if should_step:
                if tracker and tracker.measure_active():
                    tracker.accelerator_synchronize()
                if not self._ensure_finite_gradients(batch_idx):
                    accum_batches = 0
                    self._add_metric(metric_sums, metric_counts, "train/perf/nonfinite_grad_skips", 1.0)
                    continue
                self.optimizer.step()
                if self.scheduler is not None:
                    self.scheduler.step()
                self.optimizer.zero_grad()

                self.global_step += 1
                if not self.log_every_n_samples or self.log_every_n_samples <= 0:
                    self._log_wandb_step(loss_value, loss_info)
                if tracker:
                    step_metrics = None
                    if tracker.measure_active():
                        tracker.accelerator_synchronize()
                        step_time = time.perf_counter() - step_start_time
                        total_samples = step_samples * self.world_size
                        samples_per_sec = total_samples / step_time if step_time > 0 else 0.0
                        step_metrics = {
                            "throughput/samples_per_sec": samples_per_sec,
                        }
                    tracker.on_step_end(step_metrics)
                accum_batches = 0
            else:
                accum_batches += 1

            total_loss += loss_value
            if "objective" in loss_info:
                objective_value = self._as_float(loss_info["objective"])
                if objective_value is not None:
                    total_task_loss += objective_value
                    self._add_metric(metric_sums, metric_counts, "train/loss/objective", objective_value)
            for info_key, metric_name in self.train_metric_map:
                if info_key in loss_info:
                    self._add_metric(metric_sums, metric_counts, metric_name, loss_info[info_key])
            num_batches += 1

            if self.global_rank == 0 and not self.wandb_enabled:
                if self.log_every_n_steps and self.log_every_n_steps > 0:
                    if self.global_step % self.log_every_n_steps == 0:
                        pbar.set_postfix({"loss": loss_value})
                else:
                    pbar.set_postfix({"loss": loss_value})

            if self._maybe_run_validation() and self.stop_training:
                break
            self._maybe_save_periodic_checkpoint()
            if self._maybe_stop_by_samples():
                break

        avg_loss = total_loss / num_batches
        avg_task_loss = total_task_loss / num_batches

        loss_tensor = torch.tensor([avg_loss, avg_task_loss], device=self.device)
        dist.all_reduce(loss_tensor, op=dist.ReduceOp.AVG)

        metric_sums, metric_counts = self._reduce_metrics(metric_sums, metric_counts)
        metric_avgs = self._compute_metric_avgs(metric_sums, metric_counts)

        return loss_tensor[0].item(), loss_tensor[1].item(), metric_avgs

    def validate(self):
        """Run validation over the single-case validation loader.

        Evaluates constraint violations on a (possibly sub-sampled) set of
        validation batches.  Timing information is optionally collected per
        batch.  Results are reduced across all DDP ranks.

        Returns:
            tuple[float | None, float | None, dict | None]:
                ``(avg_loss, avg_task_loss, metric_avgs)``, or
                ``(None, None, None)`` when validation is disabled or no
                batches were evaluated.
        """
        self.model.eval()

        if self.violation_eval_p <= 0.0:
            return None, None, None

        total_loss = 0.0
        total_task_loss = 0.0
        num_batches = 0
        metric_sums, metric_counts = self._init_metric_trackers(self.val_metric_names)
        timed_batches = 0
        rng = self._make_violation_eval_rng()
        try:
            total_batches = len(self.val_loader)
        except Exception:
            total_batches = None
        min_eval_batches = self.violation_eval_min_batches
        if total_batches is not None and total_batches > 0:
            min_eval_batches = min(min_eval_batches, total_batches)
        eval_batches = 0

        with torch.no_grad():
            for batch_idx, batch in enumerate(self.val_loader):
                do_eval = self._should_eval_violations(
                    rng,
                    batch_idx,
                    total_batches,
                    eval_batches,
                    min_eval_batches,
                )
                if not do_eval:
                    continue
                eval_batches += 1

                do_timing = self._should_time_validation_batch(batch_idx, timed_batches)
                if do_timing:
                    self._sync_for_timing()
                    timing_start = time.perf_counter()
                batch = batch.to(self.device)
                if do_timing:
                    self._sync_for_timing()
                    timing_after_data = time.perf_counter()

                predictions = self.forward(batch)
                if do_timing:
                    self._sync_for_timing()
                    timing_after_forward = time.perf_counter()
                loss, loss_info = self.loss_manager.compute_loss(
                    predictions,
                    batch,
                    return_info=True,
                )
                if do_timing:
                    self._sync_for_timing()
                    timing_after_loss = time.perf_counter()
                    self._add_metric(
                        metric_sums,
                        metric_counts,
                        "val/perf/data_ms",
                        (timing_after_data - timing_start) * 1000.0,
                    )
                    self._add_metric(
                        metric_sums,
                        metric_counts,
                        "val/perf/forward_ms",
                        (timing_after_forward - timing_after_data) * 1000.0,
                    )
                    self._add_metric(
                        metric_sums,
                        metric_counts,
                        "val/perf/loss_ms",
                        (timing_after_loss - timing_after_forward) * 1000.0,
                    )
                    self._add_metric(
                        metric_sums,
                        metric_counts,
                        "val/perf/total_ms",
                        (timing_after_loss - timing_start) * 1000.0,
                    )
                    timed_batches += 1

                loss_value = loss.item()
                total_loss += loss_value
                self._add_metric(metric_sums, metric_counts, "val/loss/total", loss_value)
                if "objective" in loss_info:
                    objective_value = self._as_float(loss_info["objective"])
                    if objective_value is not None:
                        total_task_loss += objective_value
                        self._add_metric(metric_sums, metric_counts, "val/loss/objective", objective_value)
                for info_key, metric_name in self.val_metric_map:
                    if info_key in loss_info:
                        self._add_metric(metric_sums, metric_counts, metric_name, loss_info[info_key])
                num_batches += 1

        totals = torch.tensor(
            [total_loss, total_task_loss, float(num_batches)],
            device=self.device,
        )
        dist.all_reduce(totals, op=dist.ReduceOp.SUM)
        total_loss, total_task_loss, total_batches = totals.tolist()
        if total_batches == 0:
            return None, None, None

        avg_loss = total_loss / total_batches
        avg_task_loss = total_task_loss / total_batches

        metric_sums, metric_counts = self._reduce_metrics(metric_sums, metric_counts)
        metric_avgs = self._compute_metric_avgs(metric_sums, metric_counts)
        metric_avgs["val/perf/eval_batches"] = float(total_batches)
        self._add_val_score(metric_avgs)

        return avg_loss, avg_task_loss, metric_avgs

`train_epoch(epoch)` ¶

Execute one training epoch over the single-case dataset.

Iterates through all batches in the training loader, performing forward/backward passes with gradient accumulation, non-finite loss/gradient handling, optional throughput tracking, and mid-epoch validation/checkpointing when sample-based schedules are active.

Parameters:

Name	Type	Description	Default
`epoch`	`int`	Current epoch index (used for sampler seeding and progress display).	required

Returns:

Type	Description
	tuple[float, float, dict]: `(avg_loss, avg_task_loss, metric_avgs)` aggregated across all DDP ranks.

Source code in lumina/trainer/opf/trainer.py

def train_epoch(self, epoch):
    """Execute one training epoch over the single-case dataset.

    Iterates through all batches in the training loader, performing
    forward/backward passes with gradient accumulation, non-finite
    loss/gradient handling, optional throughput tracking, and mid-epoch
    validation/checkpointing when sample-based schedules are active.

    Args:
        epoch (int): Current epoch index (used for sampler seeding and
            progress display).

    Returns:
        tuple[float, float, dict]: ``(avg_loss, avg_task_loss, metric_avgs)``
            aggregated across all DDP ranks.
    """
    self.model.train()
    if self.train_sampler is not None:
        self.train_sampler.set_epoch(epoch)
    elif hasattr(self.train_loader.dataset, "set_epoch"):
        self.train_loader.dataset.set_epoch(epoch)

    total_loss = 0.0
    total_task_loss = 0.0
    num_batches = 0
    total_steps = len(self.train_loader)
    metric_sums, metric_counts = self._init_metric_trackers(self.train_metric_names)
    step_start_time = None
    step_samples = 0
    accum_batches = 0
    tracker = self.throughput_tracker

    if self.global_rank == 0 and not self.wandb_enabled:
        pbar = tqdm(self.train_loader, desc=f"Epoch {epoch}")
    else:
        pbar = self.train_loader

    self.optimizer.zero_grad()

    for batch_idx, batch in enumerate(pbar):
        is_step_start = accum_batches == 0
        if is_step_start:
            if tracker:
                tracker.maybe_start_measurement()
                if tracker.measure_active():
                    tracker.accelerator_synchronize()
                    step_start_time = time.perf_counter()
                    step_samples = 0

        batch = batch.to(self.device)
        batch_samples = self._update_global_samples(batch)

        if tracker and tracker.measure_active():
            step_samples += tracker.get_batch_samples(batch)

        predictions = self.forward(batch)
        loss, loss_info = self.loss_manager.compute_loss(
            predictions,
            batch,
            return_info=True,
        )

        loss_value = loss.item()
        self._add_metric(metric_sums, metric_counts, "train/loss/total", loss_value)
        if self.log_every_n_samples and self.log_every_n_samples > 0:
            self._log_wandb_step(loss_value, loss_info)
        loss = loss / self.accumulate_grad_batches
        if self._handle_nonfinite_loss(loss, batch_idx):
            accum_batches = 0
            self._add_metric(metric_sums, metric_counts, "train/perf/nonfinite_loss_skips", 1.0)
            continue

        loss.backward()

        should_step = ((batch_idx + 1) % self.accumulate_grad_batches == 0) or (
            (batch_idx + 1) == total_steps
        )

        if should_step:
            if tracker and tracker.measure_active():
                tracker.accelerator_synchronize()
            if not self._ensure_finite_gradients(batch_idx):
                accum_batches = 0
                self._add_metric(metric_sums, metric_counts, "train/perf/nonfinite_grad_skips", 1.0)
                continue
            self.optimizer.step()
            if self.scheduler is not None:
                self.scheduler.step()
            self.optimizer.zero_grad()

            self.global_step += 1
            if not self.log_every_n_samples or self.log_every_n_samples <= 0:
                self._log_wandb_step(loss_value, loss_info)
            if tracker:
                step_metrics = None
                if tracker.measure_active():
                    tracker.accelerator_synchronize()
                    step_time = time.perf_counter() - step_start_time
                    total_samples = step_samples * self.world_size
                    samples_per_sec = total_samples / step_time if step_time > 0 else 0.0
                    step_metrics = {
                        "throughput/samples_per_sec": samples_per_sec,
                    }
                tracker.on_step_end(step_metrics)
            accum_batches = 0
        else:
            accum_batches += 1

        total_loss += loss_value
        if "objective" in loss_info:
            objective_value = self._as_float(loss_info["objective"])
            if objective_value is not None:
                total_task_loss += objective_value
                self._add_metric(metric_sums, metric_counts, "train/loss/objective", objective_value)
        for info_key, metric_name in self.train_metric_map:
            if info_key in loss_info:
                self._add_metric(metric_sums, metric_counts, metric_name, loss_info[info_key])
        num_batches += 1

        if self.global_rank == 0 and not self.wandb_enabled:
            if self.log_every_n_steps and self.log_every_n_steps > 0:
                if self.global_step % self.log_every_n_steps == 0:
                    pbar.set_postfix({"loss": loss_value})
            else:
                pbar.set_postfix({"loss": loss_value})

        if self._maybe_run_validation() and self.stop_training:
            break
        self._maybe_save_periodic_checkpoint()
        if self._maybe_stop_by_samples():
            break

    avg_loss = total_loss / num_batches
    avg_task_loss = total_task_loss / num_batches

    loss_tensor = torch.tensor([avg_loss, avg_task_loss], device=self.device)
    dist.all_reduce(loss_tensor, op=dist.ReduceOp.AVG)

    metric_sums, metric_counts = self._reduce_metrics(metric_sums, metric_counts)
    metric_avgs = self._compute_metric_avgs(metric_sums, metric_counts)

    return loss_tensor[0].item(), loss_tensor[1].item(), metric_avgs

`validate()` ¶

Run validation over the single-case validation loader.

Evaluates constraint violations on a (possibly sub-sampled) set of validation batches. Timing information is optionally collected per batch. Results are reduced across all DDP ranks.

Returns:

Type	Description
	tuple[float \| None, float \| None, dict \| None]: `(avg_loss, avg_task_loss, metric_avgs)`, or `(None, None, None)` when validation is disabled or no batches were evaluated.

Source code in lumina/trainer/opf/trainer.py

def validate(self):
    """Run validation over the single-case validation loader.

    Evaluates constraint violations on a (possibly sub-sampled) set of
    validation batches.  Timing information is optionally collected per
    batch.  Results are reduced across all DDP ranks.

    Returns:
        tuple[float | None, float | None, dict | None]:
            ``(avg_loss, avg_task_loss, metric_avgs)``, or
            ``(None, None, None)`` when validation is disabled or no
            batches were evaluated.
    """
    self.model.eval()

    if self.violation_eval_p <= 0.0:
        return None, None, None

    total_loss = 0.0
    total_task_loss = 0.0
    num_batches = 0
    metric_sums, metric_counts = self._init_metric_trackers(self.val_metric_names)
    timed_batches = 0
    rng = self._make_violation_eval_rng()
    try:
        total_batches = len(self.val_loader)
    except Exception:
        total_batches = None
    min_eval_batches = self.violation_eval_min_batches
    if total_batches is not None and total_batches > 0:
        min_eval_batches = min(min_eval_batches, total_batches)
    eval_batches = 0

    with torch.no_grad():
        for batch_idx, batch in enumerate(self.val_loader):
            do_eval = self._should_eval_violations(
                rng,
                batch_idx,
                total_batches,
                eval_batches,
                min_eval_batches,
            )
            if not do_eval:
                continue
            eval_batches += 1

            do_timing = self._should_time_validation_batch(batch_idx, timed_batches)
            if do_timing:
                self._sync_for_timing()
                timing_start = time.perf_counter()
            batch = batch.to(self.device)
            if do_timing:
                self._sync_for_timing()
                timing_after_data = time.perf_counter()

            predictions = self.forward(batch)
            if do_timing:
                self._sync_for_timing()
                timing_after_forward = time.perf_counter()
            loss, loss_info = self.loss_manager.compute_loss(
                predictions,
                batch,
                return_info=True,
            )
            if do_timing:
                self._sync_for_timing()
                timing_after_loss = time.perf_counter()
                self._add_metric(
                    metric_sums,
                    metric_counts,
                    "val/perf/data_ms",
                    (timing_after_data - timing_start) * 1000.0,
                )
                self._add_metric(
                    metric_sums,
                    metric_counts,
                    "val/perf/forward_ms",
                    (timing_after_forward - timing_after_data) * 1000.0,
                )
                self._add_metric(
                    metric_sums,
                    metric_counts,
                    "val/perf/loss_ms",
                    (timing_after_loss - timing_after_forward) * 1000.0,
                )
                self._add_metric(
                    metric_sums,
                    metric_counts,
                    "val/perf/total_ms",
                    (timing_after_loss - timing_start) * 1000.0,
                )
                timed_batches += 1

            loss_value = loss.item()
            total_loss += loss_value
            self._add_metric(metric_sums, metric_counts, "val/loss/total", loss_value)
            if "objective" in loss_info:
                objective_value = self._as_float(loss_info["objective"])
                if objective_value is not None:
                    total_task_loss += objective_value
                    self._add_metric(metric_sums, metric_counts, "val/loss/objective", objective_value)
            for info_key, metric_name in self.val_metric_map:
                if info_key in loss_info:
                    self._add_metric(metric_sums, metric_counts, metric_name, loss_info[info_key])
            num_batches += 1

    totals = torch.tensor(
        [total_loss, total_task_loss, float(num_batches)],
        device=self.device,
    )
    dist.all_reduce(totals, op=dist.ReduceOp.SUM)
    total_loss, total_task_loss, total_batches = totals.tolist()
    if total_batches == 0:
        return None, None, None

    avg_loss = total_loss / total_batches
    avg_task_loss = total_task_loss / total_batches

    metric_sums, metric_counts = self._reduce_metrics(metric_sums, metric_counts)
    metric_avgs = self._compute_metric_avgs(metric_sums, metric_counts)
    metric_avgs["val/perf/eval_batches"] = float(total_batches)
    self._add_val_score(metric_avgs)

    return avg_loss, avg_task_loss, metric_avgs

`MultiCaseOPFTrainer` ¶

Bases: BaseOPFTrainer

Multi-case OPF trainer for joint training across multiple grid topologies.

Extends BaseOPFTrainer to handle multiple power-grid cases simultaneously. Each case has its own dataset, data loader, and loss manager, while sharing a single model and optimizer. Cases can be trained sequentially per epoch or interleaved every case_mix_every_n_steps batches.

Parameters:

Name	Type	Description	Default
`config`	`dict`	Full training configuration (parsed YAML).	required
`case_names`	`list[str]`	Fully-qualified PGLib case names.	required
`group_ids`	`list[int] \| int`	Data group identifiers to load.	required
`model_type`	`str`	Model architecture identifier.	required
`loss_type`	`str`	Loss function name.	`'mse'`
`minmax_scaling`	`bool`	Whether to apply min-max scaling.	`True`
`local_rank`	`int`	Local GPU rank for DDP.	`0`
`global_rank`	`int`	Global process rank for DDP.	`0`
`world_size`	`int`	Total number of DDP processes.	`1`
`wandb_run_name`	`str`	Custom W&B run name.	`None`
`wandb_group_name`	`str`	W&B run group.	`None`
`wandb_requested`	`bool`	Whether W&B logging was requested.	`False`
`wandb_project`	`str`	W&B project name.	`'lumina-training'`
`wandb_entity`	`str`	W&B entity/team name.	`None`
`run_metadata`	`dict`	Extra metadata to log with the run.	`None`

Source code in lumina/trainer/opf/trainer.py

class MultiCaseOPFTrainer(BaseOPFTrainer):
    """Multi-case OPF trainer for joint training across multiple grid topologies.

    Extends ``BaseOPFTrainer`` to handle multiple power-grid cases simultaneously.
    Each case has its own dataset, data loader, and loss manager, while sharing
    a single model and optimizer.  Cases can be trained sequentially per epoch or
    interleaved every ``case_mix_every_n_steps`` batches.

    Args:
        config (dict): Full training configuration (parsed YAML).
        case_names (list[str]): Fully-qualified PGLib case names.
        group_ids (list[int] | int): Data group identifiers to load.
        model_type (str): Model architecture identifier.
        loss_type (str): Loss function name.
        minmax_scaling (bool): Whether to apply min-max scaling.
        local_rank (int): Local GPU rank for DDP.
        global_rank (int): Global process rank for DDP.
        world_size (int): Total number of DDP processes.
        wandb_run_name (str, optional): Custom W&B run name.
        wandb_group_name (str, optional): W&B run group.
        wandb_requested (bool): Whether W&B logging was requested.
        wandb_project (str): W&B project name.
        wandb_entity (str, optional): W&B entity/team name.
        run_metadata (dict, optional): Extra metadata to log with the run.
    """

    def __init__(
        self,
        config,
        case_names,
        group_ids,
        model_type,
        loss_type="mse",
        minmax_scaling=True,
        local_rank=0,
        global_rank=0,
        world_size=1,
        wandb_run_name=None,
        wandb_group_name=None,
        wandb_requested=False,
        wandb_project="lumina-training",
        wandb_entity=None,
        run_metadata=None,
    ):
        self.case_names = list(case_names)
        self.case_keys = {idx: f"case_{idx}" for idx in range(len(self.case_names))}
        self.group_ids = _normalize_group_ids(group_ids)
        if not self.group_ids:
            raise ValueError("group_ids must contain at least one group id.")
        super().__init__(
            config=config,
            model_type=model_type,
            loss_type=loss_type,
            minmax_scaling=minmax_scaling,
            local_rank=local_rank,
            global_rank=global_rank,
            world_size=world_size,
            wandb_run_name=wandb_run_name,
            wandb_group_name=wandb_group_name,
            wandb_requested=wandb_requested,
            wandb_project=wandb_project,
            wandb_entity=wandb_entity,
            run_metadata=run_metadata,
        )

    def _load_dataset(self, case_name):
        dataset_cls = self._select_dataset_cls()
        build_kwargs = self._make_dataset_kwargs(dataset_cls, self.config["root"], case_name)
        processed_suffix = self.on_disk_homo_suffix if dataset_cls is OPFOnDiskHomogeneousDataset else None
        dataset_root = self._stage_on_disk(
            case_name,
            self.group_ids,
            dataset_cls,
            build_kwargs,
            processed_suffix,
        )
        self._log_dataset_choice(case_name, dataset_cls, dataset_root, processed_suffix=processed_suffix)
        dataset_kwargs = dict(build_kwargs)
        dataset_kwargs["root"] = dataset_root

        def build_dataset():
            if len(self.group_ids) == 1:
                return dataset_cls(group_id=self.group_ids[0], **dataset_kwargs)
            return OPFMultiDataset.from_case_groups(
                group_ids=self.group_ids,
                dataset_cls=dataset_cls,
                **dataset_kwargs,
            )
        if dist.is_available() and dist.is_initialized() and self.world_size > 1:
            if self.global_rank == 0:
                dataset = build_dataset()
            dist.barrier()
            if self.global_rank != 0:
                dataset = build_dataset()
        else:
            dataset = build_dataset()
        return dataset

    def _load_data(self):
        if self._use_sharded_backend():
            self.case_sharded_splits = {}
            reference_metadata = None
            reference_out_dim = None

            for case_idx, case_name in enumerate(self.case_names):
                splits = self._load_sharded_splits(case_name, self.group_ids)
                self.case_sharded_splits[case_idx] = splits

                sample_shards = splits["train"] or splits.get("val", []) or splits.get("test", [])
                if not sample_shards:
                    raise ValueError(f"Sharded dataset for case {case_name} is empty.")

                sample_dataset = OPFShardedIterableDataset(sample_shards, shuffle_shards=False)
                sample = sample_dataset.peek()
                out_dim = self._infer_output_dim(sample)

                if self.model_type in HETERO_MODEL_TYPES:
                    metadata = sample_dataset.metadata()
                    if reference_metadata is None:
                        reference_metadata = metadata
                        reference_out_dim = out_dim
                    else:
                        if metadata != reference_metadata:
                            raise ValueError(
                                f"Dataset metadata mismatch between cases. {case_name} does not share the same schema."
                            )
                        if out_dim != reference_out_dim:
                            raise ValueError(
                                f"Output dimension mismatch for case {case_name} "
                                f"(expected {reference_out_dim}, found {out_dim})."
                            )
                else:
                    if reference_out_dim is None:
                        reference_out_dim = out_dim
                    elif out_dim != reference_out_dim:
                        raise ValueError(
                            f"Output dimension mismatch for case {case_name} "
                            f"(expected {reference_out_dim}, found {out_dim})."
                        )

                if self.global_rank == 0:
                    counts = {split: sum(shard.num_samples for shard in shards) for split, shards in splits.items()}
                    print(
                        f"Sharded dataset loaded for {case_name}: "
                        f"train={counts.get('train', 0)}, "
                        f"val={counts.get('val', 0)}, "
                        f"test={counts.get('test', 0)} samples"
                    )

            self.reference_metadata = reference_metadata
            self.reference_output_dim = reference_out_dim
            return

        self.case_datasets = []
        reference_metadata = None
        reference_out_dim = None

        for case_name in self.case_names:
            dataset = self._load_dataset(case_name)
            if len(dataset) == 0:
                raise ValueError(f"Dataset for case {case_name} is empty.")

            if self.global_rank == 0:
                print(f"Dataset loaded for {case_name}: {len(dataset)} samples")

            sample = dataset[0]
            out_dim = self._infer_output_dim(sample)

            if self.model_type in HETERO_MODEL_TYPES:
                metadata = dataset.metadata()
                if reference_metadata is None:
                    reference_metadata = metadata
                    reference_out_dim = out_dim
                else:
                    if metadata != reference_metadata:
                        raise ValueError(
                            f"Dataset metadata mismatch between cases. {case_name} does not share the same schema."
                        )
                    if out_dim != reference_out_dim:
                        raise ValueError(
                            f"Output dimension mismatch for case {case_name} "
                            f"(expected {reference_out_dim}, found {out_dim})."
                        )
            else:
                if reference_out_dim is None:
                    reference_out_dim = out_dim
                elif out_dim != reference_out_dim:
                    raise ValueError(
                        f"Output dimension mismatch for case {case_name} "
                        f"(expected {reference_out_dim}, found {out_dim})."
                    )

            self.case_datasets.append(dataset)

        self.reference_metadata = reference_metadata
        self.reference_output_dim = reference_out_dim

    def _create_dataloaders(self):
        self.train_loaders = {}
        self.val_loaders = {}
        self.test_loaders = {}
        self.train_samplers = {}
        self.val_samplers = {}
        self.train_case_indices = []
        self.val_case_indices = []
        self.test_case_indices = []

        loader_config = self.config["loader"]
        train_ratio = self.config["train_split"]
        val_ratio = self.config["val_split"]
        split_seed = int(self.config.get("split_seed", 42))

        if self._use_sharded_backend():
            if self.model_type not in HETERO_MODEL_TYPES and not self.use_precomputed_homo:
                raise ValueError(
                    "Sharded backend requires precomputed homogeneous shards when using homo models. "
                    "Set use_precomputed_homo=true or switch backend."
                )
            for case_idx, splits in self.case_sharded_splits.items():
                train_shards = splits.get("train", [])
                val_shards = splits.get("val", [])
                test_shards = splits.get("test", [])
                if not train_shards:
                    continue
                train_dataset = CaseTaggedIterableDataset(
                    OPFShardedIterableDataset(
                        train_shards,
                        shuffle_shards=loader_config["shuffle"],
                        seed=self.sharded_split_seed + case_idx,
                    ),
                    case_idx,
                )
                self.train_samplers[case_idx] = None
                self.train_loaders[case_idx] = DataLoader(
                    train_dataset,
                    **self._loader_kwargs(loader_config),
                )
                self.train_case_indices.append(case_idx)

                if val_shards:
                    val_dataset = CaseTaggedIterableDataset(
                        OPFShardedIterableDataset(
                            val_shards,
                            shuffle_shards=False,
                            seed=self.sharded_split_seed + case_idx,
                        ),
                        case_idx,
                    )
                    val_dataset = self._maybe_limit_val_dataset(
                        val_dataset,
                        case_idx=case_idx,
                        case_label=self.case_names[case_idx] if case_idx < len(self.case_names) else None,
                    )
                    self.val_samplers[case_idx] = None
                    self.val_loaders[case_idx] = DataLoader(
                        val_dataset,
                        **self._loader_kwargs(loader_config),
                    )
                    self.val_case_indices.append(case_idx)

                if test_shards:
                    test_dataset = CaseTaggedIterableDataset(
                        OPFShardedIterableDataset(
                            test_shards,
                            shuffle_shards=False,
                            seed=self.sharded_split_seed + case_idx,
                        ),
                        case_idx,
                    )
                    self.test_loaders[case_idx] = DataLoader(
                        test_dataset,
                        **self._loader_kwargs(loader_config),
                    )
                    self.test_case_indices.append(case_idx)
            return

        for case_idx, dataset in enumerate(self.case_datasets):
            n_samples = len(dataset)
            train_len = max(1, int(n_samples * train_ratio))
            val_len = int(n_samples * val_ratio)
            if train_len + val_len >= n_samples:
                val_len = max(0, n_samples - train_len - 1)
            test_len = n_samples - train_len - val_len

            generator = torch.Generator().manual_seed(split_seed + case_idx)
            subsets = torch.utils.data.random_split(dataset, [train_len, val_len, test_len], generator=generator)
            train_dataset, val_dataset, test_dataset = subsets

            if self.model_type not in HETERO_MODEL_TYPES and not self.use_precomputed_homo:
                train_dataset = HomoOPFDataset(train_dataset)
                if val_len > 0:
                    val_dataset = HomoOPFDataset(val_dataset)
                if test_len > 0:
                    test_dataset = HomoOPFDataset(test_dataset)

            train_dataset = CaseTaggedDataset(train_dataset, case_idx)
            if val_len > 0:
                val_dataset = CaseTaggedDataset(val_dataset, case_idx)
                val_dataset = self._maybe_limit_val_dataset(
                    val_dataset,
                    case_idx=case_idx,
                    case_label=self.case_names[case_idx] if case_idx < len(self.case_names) else None,
                )
            if test_len > 0:
                test_dataset = CaseTaggedDataset(test_dataset, case_idx)

            self.train_samplers[case_idx] = DistributedSampler(
                train_dataset,
                num_replicas=self.world_size,
                rank=self.global_rank,
                shuffle=loader_config["shuffle"],
            )
            self.train_loaders[case_idx] = DataLoader(
                train_dataset,
                sampler=self.train_samplers[case_idx],
                **self._loader_kwargs(loader_config),
            )
            self.train_case_indices.append(case_idx)

            if val_len > 0:
                self.val_samplers[case_idx] = DistributedSampler(
                    val_dataset,
                    num_replicas=self.world_size,
                    rank=self.global_rank,
                    shuffle=False,
                )
                self.val_loaders[case_idx] = DataLoader(
                    val_dataset,
                    sampler=self.val_samplers[case_idx],
                    **self._loader_kwargs(loader_config),
                )
                self.val_case_indices.append(case_idx)

            if test_len > 0:
                self.test_loaders[case_idx] = DataLoader(
                    test_dataset,
                    shuffle=False,
                    **self._loader_kwargs(loader_config),
                )
                self.test_case_indices.append(case_idx)

    def _create_model(self):
        if self._use_sharded_backend():
            first_case = sorted(self.case_sharded_splits.keys())[0]
            splits = self.case_sharded_splits[first_case]
            sample_shards = splits.get("train", []) or splits.get("val", []) or splits.get("test", [])
            sample_dataset = OPFShardedIterableDataset(sample_shards, shuffle_shards=False)
            sample_data = sample_dataset.peek()
        else:
            sample_data = self.case_datasets[0][0]
        per_node_output_size = self.reference_output_dim
        metadata = self.reference_metadata if self.model_type in HETERO_MODEL_TYPES else None
        self.model = self._build_model(sample_data, metadata, per_node_output_size)

    def _initialize_loss_managers(self):
        self.loss_managers = {}

        for case_idx in range(len(self.case_names)):
            self.loss_managers[case_idx] = OPFLossManager(
                loss_type=self.loss_type,
                device=self.device,
                log_normalized_violation=self.log_normalized_violation,
            )

        if self.global_rank == 0:
            print(f"Loss Managers initialized with loss_type='{self.loss_type}'")

    def _default_wandb_run_name(self):
        case_tag = f"{len(self.case_names)}cases"
        return f"acopf-ddp-{case_tag}-{self.model_type}-{self.loss_type}"

    def _should_print_epoch(self):
        return self.global_rank == 0 and not self.wandb_enabled

    def _checkpoint_tag(self):
        if len(self.case_names) == 1:
            return self.case_names[0]
        return f"multi{len(self.case_names)}cases"

    def _checkpoint_payload(self):
        payload = super()._checkpoint_payload()
        payload["case_names"] = list(self.case_names)
        return payload

    def train_epoch(self, epoch):
        """Execute one training epoch across all cases.

        Cases are either iterated sequentially (each case exhausted before
        the next) or interleaved every ``case_mix_every_n_steps`` batches,
        depending on configuration.

        Args:
            epoch (int): Current epoch index.

        Returns:
            tuple[float, float, dict]: ``(avg_loss, avg_task_loss, metric_avgs)``
                aggregated across all cases and DDP ranks.
        """
        self.model.train()

        total_loss = 0.0
        total_task_loss = 0.0
        num_batches = 0
        metric_sums, metric_counts = self._init_metric_trackers(self.train_metric_names)
        tracker = self.throughput_tracker
        accum_batches = 0
        step_start_time = None
        step_samples = 0

        def run_batch(case_idx, batch, batch_idx, total_steps, pbar, advance_pbar, include_case_name):
            nonlocal accum_batches, step_start_time, step_samples, total_loss, total_task_loss, num_batches
            is_step_start = accum_batches == 0
            if is_step_start and tracker:
                tracker.maybe_start_measurement()
                if tracker.measure_active():
                    tracker.accelerator_synchronize()
                    step_start_time = time.perf_counter()
                    step_samples = 0

            batch = batch.to(self.device)
            batch_samples = self._update_global_samples(batch)

            if tracker and tracker.measure_active():
                step_samples += tracker.get_batch_samples(batch)

            predictions = self.forward(batch)
            loss, loss_info = self.loss_managers[case_idx].compute_loss(
                predictions,
                batch,
                return_info=True,
            )

            loss_value = loss.item()
            self._add_metric(metric_sums, metric_counts, "train/loss/total", loss_value)
            if self.log_every_n_samples and self.log_every_n_samples > 0:
                self._log_wandb_step(loss_value, loss_info)
            loss = loss / self.accumulate_grad_batches
            case_name = self.case_names[case_idx] if include_case_name else None
            if self._handle_nonfinite_loss(loss, batch_idx, case_name=case_name):
                accum_batches = 0
                self._add_metric(metric_sums, metric_counts, "train/perf/nonfinite_loss_skips", 1.0)
                return
            loss.backward()

            should_step = ((batch_idx + 1) % self.accumulate_grad_batches == 0) or ((batch_idx + 1) == total_steps)

            if should_step:
                if tracker and tracker.measure_active():
                    tracker.accelerator_synchronize()
                if not self._ensure_finite_gradients(batch_idx, case_name=case_name):
                    accum_batches = 0
                    self._add_metric(metric_sums, metric_counts, "train/perf/nonfinite_grad_skips", 1.0)
                    return
                self.optimizer.step()
                if self.scheduler is not None:
                    self.scheduler.step()
                self.optimizer.zero_grad()

                self.global_step += 1
                if not self.log_every_n_samples or self.log_every_n_samples <= 0:
                    self._log_wandb_step(loss_value, loss_info)
                if tracker:
                    step_metrics = None
                    if tracker.measure_active() and step_start_time is not None:
                        tracker.accelerator_synchronize()
                        step_time = time.perf_counter() - step_start_time
                        total_samples = step_samples * self.world_size
                        samples_per_sec = total_samples / step_time if step_time > 0 else 0.0
                        step_metrics = {"throughput/samples_per_sec": samples_per_sec}
                    tracker.on_step_end(step_metrics)
                accum_batches = 0
            else:
                accum_batches += 1

            total_loss += loss_value
            if "objective" in loss_info:
                objective_value = self._as_float(loss_info["objective"])
                if objective_value is not None:
                    total_task_loss += objective_value
                    self._add_metric(metric_sums, metric_counts, "train/loss/objective", objective_value)
            for info_key, metric_name in self.train_metric_map:
                if info_key in loss_info:
                    self._add_metric(metric_sums, metric_counts, metric_name, loss_info[info_key])
            num_batches += 1

            if pbar is not None and self.global_rank == 0 and not self.wandb_enabled:
                postfix = {"loss": loss_value}
                if include_case_name:
                    postfix["case"] = self.case_names[case_idx]
                if self.log_every_n_steps and self.log_every_n_steps > 0:
                    if self.global_step % self.log_every_n_steps == 0:
                        pbar.set_postfix(postfix)
                else:
                    pbar.set_postfix(postfix)
                if advance_pbar:
                    pbar.update(1)

            if self._maybe_run_validation() and self.stop_training:
                return True
            self._maybe_save_periodic_checkpoint()
            if self._maybe_stop_by_samples():
                return True
            return False

        mix_every = self.case_mix_every_n_steps
        if mix_every <= 0 or len(self.train_case_indices) <= 1:
            for case_idx in self.train_case_indices:
                loader = self.train_loaders[case_idx]
                sampler = self.train_samplers.get(case_idx)
                if sampler is not None:
                    sampler.set_epoch(epoch)
                elif hasattr(loader.dataset, "set_epoch"):
                    loader.dataset.set_epoch(epoch)

                total_steps = len(loader)
                if total_steps == 0:
                    continue

                if self.global_rank == 0 and not self.wandb_enabled:
                    case_name = self.case_names[case_idx]
                    pbar = tqdm(loader, desc=f"Epoch {epoch} {case_name}")
                else:
                    pbar = loader

                self.optimizer.zero_grad()
                accum_batches = 0
                step_start_time = None
                step_samples = 0

                for batch_idx, batch in enumerate(pbar):
                    should_break = run_batch(
                        case_idx,
                        batch,
                        batch_idx,
                        total_steps,
                        pbar,
                        advance_pbar=False,
                        include_case_name=False,
                    )
                    if should_break:
                        break

                if self.stop_training:
                    break
        else:
            for case_idx in self.train_case_indices:
                loader = self.train_loaders[case_idx]
                sampler = self.train_samplers.get(case_idx)
                if sampler is not None:
                    sampler.set_epoch(epoch)
                elif hasattr(loader.dataset, "set_epoch"):
                    loader.dataset.set_epoch(epoch)

            loader_lengths = {}
            active_cases = []
            total_steps = 0
            for case_idx in self.train_case_indices:
                steps = len(self.train_loaders[case_idx])
                loader_lengths[case_idx] = steps
                if steps > 0:
                    active_cases.append(case_idx)
                    total_steps += steps

            if total_steps == 0:
                return 0.0, 0.0, {}

            iterators = {case_idx: iter(self.train_loaders[case_idx]) for case_idx in active_cases}
            if self.global_rank == 0 and not self.wandb_enabled:
                pbar = tqdm(total=total_steps, desc=f"Epoch {epoch}")
            else:
                pbar = None

            self.optimizer.zero_grad()
            accum_batches = 0
            step_start_time = None
            step_samples = 0

            def iter_case_schedule():
                steps_left = {case_idx: loader_lengths[case_idx] for case_idx in active_cases}
                while True:
                    did_yield = False
                    for case_idx in active_cases:
                        remaining = steps_left[case_idx]
                        if remaining <= 0:
                            continue
                        take = mix_every if remaining > mix_every else remaining
                        for _ in range(take):
                            yield case_idx
                        steps_left[case_idx] = remaining - take
                        did_yield = True
                    if not did_yield:
                        break

            for batch_idx, case_idx in enumerate(iter_case_schedule()):
                batch = next(iterators[case_idx])
                should_break = run_batch(
                    case_idx,
                    batch,
                    batch_idx,
                    total_steps,
                    pbar,
                    advance_pbar=True,
                    include_case_name=True,
                )
                if should_break:
                    break

            if pbar is not None:
                pbar.close()

        if num_batches == 0:
            return 0.0, 0.0, {}

        avg_loss = total_loss / num_batches
        avg_task_loss = total_task_loss / num_batches

        loss_tensor = torch.tensor([avg_loss, avg_task_loss], device=self.device)
        dist.all_reduce(loss_tensor, op=dist.ReduceOp.AVG)

        metric_sums, metric_counts = self._reduce_metrics(metric_sums, metric_counts)
        metric_avgs = self._compute_metric_avgs(metric_sums, metric_counts)

        return loss_tensor[0].item(), loss_tensor[1].item(), metric_avgs

    def validate(self):
        """Run validation across all cases with validation loaders.

        Iterates through each case's validation loader, evaluating
        constraint violations.  Results are aggregated across cases and
        reduced across all DDP ranks.

        Returns:
            tuple[float | None, float | None, dict | None]:
                ``(avg_loss, avg_task_loss, metric_avgs)``, or
                ``(None, None, None)`` when validation is disabled or no
                batches were evaluated.
        """
        if not self.val_case_indices:
            return None, None, None

        self.model.eval()

        if self.violation_eval_p <= 0.0:
            return None, None, None

        total_loss = 0.0
        total_task_loss = 0.0
        num_batches = 0
        metric_sums, metric_counts = self._init_metric_trackers(self.val_metric_names)
        timed_batches = 0

        with torch.no_grad():
            for case_idx in self.val_case_indices:
                loader = self.val_loaders[case_idx]
                if loader is None:
                    continue
                rng = self._make_violation_eval_rng(case_idx)
                try:
                    total_batches = len(loader)
                except Exception:
                    total_batches = None
                min_eval_batches = self.violation_eval_min_batches
                if total_batches is not None and total_batches > 0:
                    min_eval_batches = min(min_eval_batches, total_batches)
                eval_batches = 0

                for batch_idx, batch in enumerate(loader):
                    do_eval = self._should_eval_violations(
                        rng,
                        batch_idx,
                        total_batches,
                        eval_batches,
                        min_eval_batches,
                    )
                    if not do_eval:
                        continue
                    eval_batches += 1

                    do_timing = self._should_time_validation_batch(batch_idx, timed_batches)
                    if do_timing:
                        self._sync_for_timing()
                        timing_start = time.perf_counter()
                    batch = batch.to(self.device)
                    if do_timing:
                        self._sync_for_timing()
                        timing_after_data = time.perf_counter()

                    predictions = self.forward(batch)
                    if do_timing:
                        self._sync_for_timing()
                        timing_after_forward = time.perf_counter()
                    loss, loss_info = self.loss_managers[case_idx].compute_loss(
                        predictions,
                        batch,
                        return_info=True,
                    )
                    if do_timing:
                        self._sync_for_timing()
                        timing_after_loss = time.perf_counter()
                        self._add_metric(
                            metric_sums,
                            metric_counts,
                            "val/perf/data_ms",
                            (timing_after_data - timing_start) * 1000.0,
                        )
                        self._add_metric(
                            metric_sums,
                            metric_counts,
                            "val/perf/forward_ms",
                            (timing_after_forward - timing_after_data) * 1000.0,
                        )
                        self._add_metric(
                            metric_sums,
                            metric_counts,
                            "val/perf/loss_ms",
                            (timing_after_loss - timing_after_forward) * 1000.0,
                        )
                        self._add_metric(
                            metric_sums,
                            metric_counts,
                            "val/perf/total_ms",
                            (timing_after_loss - timing_start) * 1000.0,
                        )
                        timed_batches += 1

                    loss_value = loss.item()
                    total_loss += loss_value
                    self._add_metric(metric_sums, metric_counts, "val/loss/total", loss_value)
                    if "objective" in loss_info:
                        objective_value = self._as_float(loss_info["objective"])
                    if objective_value is not None:
                        total_task_loss += objective_value
                        self._add_metric(metric_sums, metric_counts, "val/loss/objective", objective_value)
                    for info_key, metric_name in self.val_metric_map:
                        if info_key in loss_info:
                            self._add_metric(metric_sums, metric_counts, metric_name, loss_info[info_key])
                    num_batches += 1

        if num_batches == 0:
            return None, None, None

        avg_loss = total_loss / num_batches
        avg_task_loss = total_task_loss / num_batches

        loss_tensor = torch.tensor([avg_loss, avg_task_loss], device=self.device)
        dist.all_reduce(loss_tensor, op=dist.ReduceOp.AVG)

        metric_sums, metric_counts = self._reduce_metrics(metric_sums, metric_counts)
        metric_avgs = self._compute_metric_avgs(metric_sums, metric_counts)
        eval_batches_tensor = torch.tensor(float(num_batches), device=self.device)
        if dist.is_available() and dist.is_initialized() and self.world_size > 1:
            dist.all_reduce(eval_batches_tensor, op=dist.ReduceOp.SUM)
        metric_avgs["val/perf/eval_batches"] = float(eval_batches_tensor.item())
        self._add_val_score(metric_avgs)

        return loss_tensor[0].item(), loss_tensor[1].item(), metric_avgs

`train_epoch(epoch)` ¶

Execute one training epoch across all cases.

Cases are either iterated sequentially (each case exhausted before the next) or interleaved every case_mix_every_n_steps batches, depending on configuration.

Parameters:

Name	Type	Description	Default
`epoch`	`int`	Current epoch index.	required

Returns:

Type	Description
	tuple[float, float, dict]: `(avg_loss, avg_task_loss, metric_avgs)` aggregated across all cases and DDP ranks.

Source code in lumina/trainer/opf/trainer.py

def train_epoch(self, epoch):
    """Execute one training epoch across all cases.

    Cases are either iterated sequentially (each case exhausted before
    the next) or interleaved every ``case_mix_every_n_steps`` batches,
    depending on configuration.

    Args:
        epoch (int): Current epoch index.

    Returns:
        tuple[float, float, dict]: ``(avg_loss, avg_task_loss, metric_avgs)``
            aggregated across all cases and DDP ranks.
    """
    self.model.train()

    total_loss = 0.0
    total_task_loss = 0.0
    num_batches = 0
    metric_sums, metric_counts = self._init_metric_trackers(self.train_metric_names)
    tracker = self.throughput_tracker
    accum_batches = 0
    step_start_time = None
    step_samples = 0

    def run_batch(case_idx, batch, batch_idx, total_steps, pbar, advance_pbar, include_case_name):
        nonlocal accum_batches, step_start_time, step_samples, total_loss, total_task_loss, num_batches
        is_step_start = accum_batches == 0
        if is_step_start and tracker:
            tracker.maybe_start_measurement()
            if tracker.measure_active():
                tracker.accelerator_synchronize()
                step_start_time = time.perf_counter()
                step_samples = 0

        batch = batch.to(self.device)
        batch_samples = self._update_global_samples(batch)

        if tracker and tracker.measure_active():
            step_samples += tracker.get_batch_samples(batch)

        predictions = self.forward(batch)
        loss, loss_info = self.loss_managers[case_idx].compute_loss(
            predictions,
            batch,
            return_info=True,
        )

        loss_value = loss.item()
        self._add_metric(metric_sums, metric_counts, "train/loss/total", loss_value)
        if self.log_every_n_samples and self.log_every_n_samples > 0:
            self._log_wandb_step(loss_value, loss_info)
        loss = loss / self.accumulate_grad_batches
        case_name = self.case_names[case_idx] if include_case_name else None
        if self._handle_nonfinite_loss(loss, batch_idx, case_name=case_name):
            accum_batches = 0
            self._add_metric(metric_sums, metric_counts, "train/perf/nonfinite_loss_skips", 1.0)
            return
        loss.backward()

        should_step = ((batch_idx + 1) % self.accumulate_grad_batches == 0) or ((batch_idx + 1) == total_steps)

        if should_step:
            if tracker and tracker.measure_active():
                tracker.accelerator_synchronize()
            if not self._ensure_finite_gradients(batch_idx, case_name=case_name):
                accum_batches = 0
                self._add_metric(metric_sums, metric_counts, "train/perf/nonfinite_grad_skips", 1.0)
                return
            self.optimizer.step()
            if self.scheduler is not None:
                self.scheduler.step()
            self.optimizer.zero_grad()

            self.global_step += 1
            if not self.log_every_n_samples or self.log_every_n_samples <= 0:
                self._log_wandb_step(loss_value, loss_info)
            if tracker:
                step_metrics = None
                if tracker.measure_active() and step_start_time is not None:
                    tracker.accelerator_synchronize()
                    step_time = time.perf_counter() - step_start_time
                    total_samples = step_samples * self.world_size
                    samples_per_sec = total_samples / step_time if step_time > 0 else 0.0
                    step_metrics = {"throughput/samples_per_sec": samples_per_sec}
                tracker.on_step_end(step_metrics)
            accum_batches = 0
        else:
            accum_batches += 1

        total_loss += loss_value
        if "objective" in loss_info:
            objective_value = self._as_float(loss_info["objective"])
            if objective_value is not None:
                total_task_loss += objective_value
                self._add_metric(metric_sums, metric_counts, "train/loss/objective", objective_value)
        for info_key, metric_name in self.train_metric_map:
            if info_key in loss_info:
                self._add_metric(metric_sums, metric_counts, metric_name, loss_info[info_key])
        num_batches += 1

        if pbar is not None and self.global_rank == 0 and not self.wandb_enabled:
            postfix = {"loss": loss_value}
            if include_case_name:
                postfix["case"] = self.case_names[case_idx]
            if self.log_every_n_steps and self.log_every_n_steps > 0:
                if self.global_step % self.log_every_n_steps == 0:
                    pbar.set_postfix(postfix)
            else:
                pbar.set_postfix(postfix)
            if advance_pbar:
                pbar.update(1)

        if self._maybe_run_validation() and self.stop_training:
            return True
        self._maybe_save_periodic_checkpoint()
        if self._maybe_stop_by_samples():
            return True
        return False

    mix_every = self.case_mix_every_n_steps
    if mix_every <= 0 or len(self.train_case_indices) <= 1:
        for case_idx in self.train_case_indices:
            loader = self.train_loaders[case_idx]
            sampler = self.train_samplers.get(case_idx)
            if sampler is not None:
                sampler.set_epoch(epoch)
            elif hasattr(loader.dataset, "set_epoch"):
                loader.dataset.set_epoch(epoch)

            total_steps = len(loader)
            if total_steps == 0:
                continue

            if self.global_rank == 0 and not self.wandb_enabled:
                case_name = self.case_names[case_idx]
                pbar = tqdm(loader, desc=f"Epoch {epoch} {case_name}")
            else:
                pbar = loader

            self.optimizer.zero_grad()
            accum_batches = 0
            step_start_time = None
            step_samples = 0

            for batch_idx, batch in enumerate(pbar):
                should_break = run_batch(
                    case_idx,
                    batch,
                    batch_idx,
                    total_steps,
                    pbar,
                    advance_pbar=False,
                    include_case_name=False,
                )
                if should_break:
                    break

            if self.stop_training:
                break
    else:
        for case_idx in self.train_case_indices:
            loader = self.train_loaders[case_idx]
            sampler = self.train_samplers.get(case_idx)
            if sampler is not None:
                sampler.set_epoch(epoch)
            elif hasattr(loader.dataset, "set_epoch"):
                loader.dataset.set_epoch(epoch)

        loader_lengths = {}
        active_cases = []
        total_steps = 0
        for case_idx in self.train_case_indices:
            steps = len(self.train_loaders[case_idx])
            loader_lengths[case_idx] = steps
            if steps > 0:
                active_cases.append(case_idx)
                total_steps += steps

        if total_steps == 0:
            return 0.0, 0.0, {}

        iterators = {case_idx: iter(self.train_loaders[case_idx]) for case_idx in active_cases}
        if self.global_rank == 0 and not self.wandb_enabled:
            pbar = tqdm(total=total_steps, desc=f"Epoch {epoch}")
        else:
            pbar = None

        self.optimizer.zero_grad()
        accum_batches = 0
        step_start_time = None
        step_samples = 0

        def iter_case_schedule():
            steps_left = {case_idx: loader_lengths[case_idx] for case_idx in active_cases}
            while True:
                did_yield = False
                for case_idx in active_cases:
                    remaining = steps_left[case_idx]
                    if remaining <= 0:
                        continue
                    take = mix_every if remaining > mix_every else remaining
                    for _ in range(take):
                        yield case_idx
                    steps_left[case_idx] = remaining - take
                    did_yield = True
                if not did_yield:
                    break

        for batch_idx, case_idx in enumerate(iter_case_schedule()):
            batch = next(iterators[case_idx])
            should_break = run_batch(
                case_idx,
                batch,
                batch_idx,
                total_steps,
                pbar,
                advance_pbar=True,
                include_case_name=True,
            )
            if should_break:
                break

        if pbar is not None:
            pbar.close()

    if num_batches == 0:
        return 0.0, 0.0, {}

    avg_loss = total_loss / num_batches
    avg_task_loss = total_task_loss / num_batches

    loss_tensor = torch.tensor([avg_loss, avg_task_loss], device=self.device)
    dist.all_reduce(loss_tensor, op=dist.ReduceOp.AVG)

    metric_sums, metric_counts = self._reduce_metrics(metric_sums, metric_counts)
    metric_avgs = self._compute_metric_avgs(metric_sums, metric_counts)

    return loss_tensor[0].item(), loss_tensor[1].item(), metric_avgs

`validate()` ¶

Run validation across all cases with validation loaders.

Iterates through each case's validation loader, evaluating constraint violations. Results are aggregated across cases and reduced across all DDP ranks.

Returns:

Type	Description
	tuple[float \| None, float \| None, dict \| None]: `(avg_loss, avg_task_loss, metric_avgs)`, or `(None, None, None)` when validation is disabled or no batches were evaluated.

Source code in lumina/trainer/opf/trainer.py

def validate(self):
    """Run validation across all cases with validation loaders.

    Iterates through each case's validation loader, evaluating
    constraint violations.  Results are aggregated across cases and
    reduced across all DDP ranks.

    Returns:
        tuple[float | None, float | None, dict | None]:
            ``(avg_loss, avg_task_loss, metric_avgs)``, or
            ``(None, None, None)`` when validation is disabled or no
            batches were evaluated.
    """
    if not self.val_case_indices:
        return None, None, None

    self.model.eval()

    if self.violation_eval_p <= 0.0:
        return None, None, None

    total_loss = 0.0
    total_task_loss = 0.0
    num_batches = 0
    metric_sums, metric_counts = self._init_metric_trackers(self.val_metric_names)
    timed_batches = 0

    with torch.no_grad():
        for case_idx in self.val_case_indices:
            loader = self.val_loaders[case_idx]
            if loader is None:
                continue
            rng = self._make_violation_eval_rng(case_idx)
            try:
                total_batches = len(loader)
            except Exception:
                total_batches = None
            min_eval_batches = self.violation_eval_min_batches
            if total_batches is not None and total_batches > 0:
                min_eval_batches = min(min_eval_batches, total_batches)
            eval_batches = 0

            for batch_idx, batch in enumerate(loader):
                do_eval = self._should_eval_violations(
                    rng,
                    batch_idx,
                    total_batches,
                    eval_batches,
                    min_eval_batches,
                )
                if not do_eval:
                    continue
                eval_batches += 1

                do_timing = self._should_time_validation_batch(batch_idx, timed_batches)
                if do_timing:
                    self._sync_for_timing()
                    timing_start = time.perf_counter()
                batch = batch.to(self.device)
                if do_timing:
                    self._sync_for_timing()
                    timing_after_data = time.perf_counter()

                predictions = self.forward(batch)
                if do_timing:
                    self._sync_for_timing()
                    timing_after_forward = time.perf_counter()
                loss, loss_info = self.loss_managers[case_idx].compute_loss(
                    predictions,
                    batch,
                    return_info=True,
                )
                if do_timing:
                    self._sync_for_timing()
                    timing_after_loss = time.perf_counter()
                    self._add_metric(
                        metric_sums,
                        metric_counts,
                        "val/perf/data_ms",
                        (timing_after_data - timing_start) * 1000.0,
                    )
                    self._add_metric(
                        metric_sums,
                        metric_counts,
                        "val/perf/forward_ms",
                        (timing_after_forward - timing_after_data) * 1000.0,
                    )
                    self._add_metric(
                        metric_sums,
                        metric_counts,
                        "val/perf/loss_ms",
                        (timing_after_loss - timing_after_forward) * 1000.0,
                    )
                    self._add_metric(
                        metric_sums,
                        metric_counts,
                        "val/perf/total_ms",
                        (timing_after_loss - timing_start) * 1000.0,
                    )
                    timed_batches += 1

                loss_value = loss.item()
                total_loss += loss_value
                self._add_metric(metric_sums, metric_counts, "val/loss/total", loss_value)
                if "objective" in loss_info:
                    objective_value = self._as_float(loss_info["objective"])
                if objective_value is not None:
                    total_task_loss += objective_value
                    self._add_metric(metric_sums, metric_counts, "val/loss/objective", objective_value)
                for info_key, metric_name in self.val_metric_map:
                    if info_key in loss_info:
                        self._add_metric(metric_sums, metric_counts, metric_name, loss_info[info_key])
                num_batches += 1

    if num_batches == 0:
        return None, None, None

    avg_loss = total_loss / num_batches
    avg_task_loss = total_task_loss / num_batches

    loss_tensor = torch.tensor([avg_loss, avg_task_loss], device=self.device)
    dist.all_reduce(loss_tensor, op=dist.ReduceOp.AVG)

    metric_sums, metric_counts = self._reduce_metrics(metric_sums, metric_counts)
    metric_avgs = self._compute_metric_avgs(metric_sums, metric_counts)
    eval_batches_tensor = torch.tensor(float(num_batches), device=self.device)
    if dist.is_available() and dist.is_initialized() and self.world_size > 1:
        dist.all_reduce(eval_batches_tensor, op=dist.ReduceOp.SUM)
    metric_avgs["val/perf/eval_batches"] = float(eval_batches_tensor.item())
    self._add_val_score(metric_avgs)

    return loss_tensor[0].item(), loss_tensor[1].item(), metric_avgs

Utilities¶

`parse_case_name(case_input: str) -> str` ¶

Resolve a user-provided case identifier to its full PGLib name.

Accepts short names ("case14"), numeric-only strings ("14"), or already-qualified PGLib names ("pglib_opf_case14_ieee").

Parameters:

Name	Type	Description	Default
`case_input`	`str`	Case identifier to resolve.	required

Returns:

Name	Type	Description
`str`	`str`	Fully-qualified PGLib case name.

Raises:

Type	Description
`ValueError`	If case_input cannot be mapped to a known case.

Source code in lumina/trainer/opf/utils.py

def parse_case_name(case_input: str) -> str:
    """Resolve a user-provided case identifier to its full PGLib name.

    Accepts short names (``"case14"``), numeric-only strings (``"14"``), or
    already-qualified PGLib names (``"pglib_opf_case14_ieee"``).

    Args:
        case_input (str): Case identifier to resolve.

    Returns:
        str: Fully-qualified PGLib case name.

    Raises:
        ValueError: If *case_input* cannot be mapped to a known case.
    """
    case_mapping = get_case_name_mapping()

    if case_input.startswith("pglib_opf_"):
        return case_input

    if case_input in case_mapping:
        return case_mapping[case_input]

    if not case_input.startswith("case"):
        case_input = "case" + case_input
        if case_input in case_mapping:
            return case_mapping[case_input]

    available_short = list(case_mapping.keys())
    available_full = list(case_mapping.values())
    raise ValueError(
        f"Invalid case name '{case_input}'. Available short names: {available_short}, "
        f"or use full names: {available_full}"
    )

`parse_cases_arg(cases_arg)` ¶

Expand a list of case arguments into individual case name strings.

Handles JSON-encoded lists ("[case14,case30]"), comma-separated entries ("case14,case30"), and plain strings.

Parameters:

Name	Type	Description	Default
`cases_arg`	`list[str]`	Raw case arguments, typically from CLI.	required

Returns:

Type	Description
	list[str]: Flat list of individual case name strings.

Source code in lumina/trainer/opf/utils.py

def parse_cases_arg(cases_arg):
    """Expand a list of case arguments into individual case name strings.

    Handles JSON-encoded lists (``"[case14,case30]"``), comma-separated
    entries (``"case14,case30"``), and plain strings.

    Args:
        cases_arg (list[str]): Raw case arguments, typically from CLI.

    Returns:
        list[str]: Flat list of individual case name strings.
    """
    expanded = []
    for entry in cases_arg:
        entry = entry.strip()
        if not entry:
            continue
        if entry.startswith("["):
            expanded.extend(json.loads(entry))
        elif "," in entry:
            expanded.extend(x.strip() for x in entry.split(",") if x.strip())
        else:
            expanded.append(entry)
    return expanded

`get_case_name_mapping()` ¶

Return a copy of the short-name to full PGLib case name mapping.

Returns:

Name	Type	Description
`dict`		Mapping from short names (e.g. `"case14"`) to full PGLib names (e.g. `"pglib_opf_case14_ieee"`).

Source code in lumina/trainer/opf/utils.py

def get_case_name_mapping():
    """Return a copy of the short-name to full PGLib case name mapping.

    Returns:
        dict: Mapping from short names (e.g. ``"case14"``) to full PGLib names
            (e.g. ``"pglib_opf_case14_ieee"``).
    """
    return dict(_CASE_NAME_MAPPING)

`resolve_hetero_model_type(model_type=None, model_class_path=None, default='HeteroGNN')` ¶

Resolve a heterogeneous model type string to its canonical form.

Accepts a model_type name, a fully-qualified model_class_path, or falls back to default. Resolution is case-insensitive.

Parameters:

Name	Type	Description	Default
`model_type`	`str`	Short model type name (e.g. `"HeteroGNN"`).	`None`
`model_class_path`	`str`	Dotted class path (e.g. `"lumina.model.opf.hetero_model.OPFHeteroGNN"`).	`None`
`default`	`str`	Fallback model type when both arguments are absent.	`'HeteroGNN'`

Returns:

Name	Type	Description
`str`		Canonical model type string (one of `HETERO_MODEL_TYPES`).

Raises:

Type	Description
`ValueError`	If the provided identifiers cannot be mapped to a supported model type.

Source code in lumina/trainer/opf/utils.py

def resolve_hetero_model_type(model_type=None, model_class_path=None, default="HeteroGNN"):
    """Resolve a heterogeneous model type string to its canonical form.

    Accepts a ``model_type`` name, a fully-qualified ``model_class_path``,
    or falls back to *default*. Resolution is case-insensitive.

    Args:
        model_type (str, optional): Short model type name (e.g. ``"HeteroGNN"``).
        model_class_path (str, optional): Dotted class path
            (e.g. ``"lumina.model.opf.hetero_model.OPFHeteroGNN"``).
        default (str): Fallback model type when both arguments are absent.

    Returns:
        str: Canonical model type string (one of ``HETERO_MODEL_TYPES``).

    Raises:
        ValueError: If the provided identifiers cannot be mapped to a
            supported model type.
    """
    if isinstance(model_class_path, str) and model_class_path.strip():
        class_name = model_class_path.rsplit(".", 1)[-1]
        normalized = _canonical_hetero_model_type(class_name)
        if normalized is not None:
            return normalized
        raise ValueError(
            f"Unsupported hetero model class path '{model_class_path}'. "
            f"Supported classes: {sorted(HETERO_MODEL_CLASSES.keys()) + ['OPFHeteroGNN']}"
        )

    if isinstance(model_type, str) and model_type.strip():
        normalized = _canonical_hetero_model_type(model_type)
        if normalized is not None:
            return normalized
        raise ValueError(
            f"Unsupported hetero model type '{model_type}'. Supported types: {sorted(HETERO_MODEL_TYPES)}"
        )

    normalized_default = _canonical_hetero_model_type(default)
    if normalized_default is not None:
        return normalized_default

    supported = ", ".join(sorted(HETERO_MODEL_TYPES))
    raise ValueError(
        f"Unable to resolve hetero model type from model_type='{model_type}' "
        f"and model_class_path='{model_class_path}'. Supported types: {supported}"
    )

`build_hetero_model_spec(model_type, metadata, input_channels, models_config, out_channels=2)` ¶

Build the class, keyword arguments, and config for a heterogeneous GNN.

Looks up architecture-specific hyper-parameters from models_config, falling back to the HeteroGNN section when the requested type has no dedicated entry. Sensible defaults are applied for hidden_channels, num_layers, backend, and attention heads.

Parameters:

Name	Type	Description	Default
`model_type`	`str`	Canonical model type (e.g. `"HeteroGNN"`, `"HGT"`).	required
`metadata`	`tuple`	Graph metadata `(node_types, edge_types)`.	required
`input_channels`	`dict`	Per-node-type input feature dimensions.	required
`models_config`	`dict`	Model hyper-parameter sections from the YAML config.	required
`out_channels`	`int`	Per-node output dimension.	`2`

Returns:

Name	Type	Description
`tuple`		`(model_class, model_kwargs, model_config, used_fallback)` where model_class is the `nn.Module` subclass, model_kwargs are ready-to-pass constructor arguments, model_config is the raw config dict used, and used_fallback indicates whether the `HeteroGNN` config was used as a substitute.

Raises:

Type	Description
`ValueError`	If model_type is not a supported hetero model.

Source code in lumina/trainer/opf/utils.py

def build_hetero_model_spec(
    model_type,
    metadata,
    input_channels,
    models_config,
    out_channels=2,
):
    """Build the class, keyword arguments, and config for a heterogeneous GNN.

    Looks up architecture-specific hyper-parameters from *models_config*,
    falling back to the ``HeteroGNN`` section when the requested type has no
    dedicated entry.  Sensible defaults are applied for ``hidden_channels``,
    ``num_layers``, ``backend``, and attention heads.

    Args:
        model_type (str): Canonical model type (e.g. ``"HeteroGNN"``, ``"HGT"``).
        metadata (tuple): Graph metadata ``(node_types, edge_types)``.
        input_channels (dict): Per-node-type input feature dimensions.
        models_config (dict): Model hyper-parameter sections from the YAML config.
        out_channels (int): Per-node output dimension.

    Returns:
        tuple: ``(model_class, model_kwargs, model_config, used_fallback)`` where
            *model_class* is the ``nn.Module`` subclass, *model_kwargs* are
            ready-to-pass constructor arguments, *model_config* is the raw
            config dict used, and *used_fallback* indicates whether the
            ``HeteroGNN`` config was used as a substitute.

    Raises:
        ValueError: If *model_type* is not a supported hetero model.
    """
    normalized_type = resolve_hetero_model_type(model_type=model_type, default=None)
    if normalized_type not in HETERO_MODEL_CLASSES:
        supported = ", ".join(sorted(HETERO_MODEL_CLASSES.keys()))
        raise ValueError(f"Unsupported hetero model type '{normalized_type}'. Supported types: {supported}")

    if not isinstance(models_config, dict):
        models_config = {}

    model_config = models_config.get(normalized_type)
    used_fallback = False
    if not isinstance(model_config, dict):
        fallback = models_config.get("HeteroGNN")
        if isinstance(fallback, dict):
            model_config = fallback
            used_fallback = normalized_type != "HeteroGNN"
        else:
            model_config = {}

    model_kwargs = {
        "metadata": metadata,
        "input_channels": input_channels,
    }
    model_kwargs.update(model_config)
    model_kwargs["out_channels"] = int(out_channels)

    model_kwargs.setdefault("hidden_channels", 64)
    model_kwargs.setdefault("num_layers", 3)
    model_kwargs.setdefault("backend", "sage")

    if normalized_type in {"RGAT", "HGT"}:
        model_kwargs.setdefault("num_heads", 1)
    if normalized_type == "HEAT":
        model_kwargs.setdefault("attention_heads", 1)

    return HETERO_MODEL_CLASSES[normalized_type], model_kwargs, model_config, used_fallback

`initialize_model(model, sample_data, device)` ¶

Perform a lazy-initialization forward pass on model.

Moves the model and a sample data point to device, then runs a torch.no_grad() forward pass so that any lazily-initialized parameters are materialized.

Parameters:

Name	Type	Description	Default
`model`	`Module`	Model to initialize.	required
`sample_data`		A single graph sample (hetero or homo) from the dataset.	required
`device`	`device`	Target device.	required

Returns:

Type	Description
	torch.nn.Module: The initialized model (same object, moved to device).

Source code in lumina/trainer/opf/utils.py

def initialize_model(model, sample_data, device):
    """Perform a lazy-initialization forward pass on *model*.

    Moves the model and a sample data point to *device*, then runs a
    ``torch.no_grad()`` forward pass so that any lazily-initialized
    parameters are materialized.

    Args:
        model (torch.nn.Module): Model to initialize.
        sample_data: A single graph sample (hetero or homo) from the dataset.
        device (torch.device): Target device.

    Returns:
        torch.nn.Module: The initialized model (same object, moved to *device*).
    """
    if _is_main_process():
        print("Initializing model parameters...")

    model = model.to(device)
    sample_data = sample_data.to(device)

    model.eval()
    with torch.no_grad():
        try:
            if isinstance(sample_data, (dict, torch.nn.ParameterDict)) or hasattr(sample_data, "x_dict"):
                x_dict = {k: v.float() for k, v in sample_data.x_dict.items()}
                _ = model(x_dict, sample_data.edge_index_dict)
            else:
                if hasattr(sample_data, "x"):
                    sample_data.x = sample_data.x.float()
                _ = model(sample_data)
            if _is_main_process():
                print("Model parameters initialized successfully!")
        except Exception as exc:
            if _is_main_process():
                print(f"Warning: Model initialization failed: {exc}")
                print("Model may still work during training...")

    return model

`apply_nested(target_dict, dotted_key, value)` ¶

Set a value in a nested dictionary using a dot-separated key path.

Intermediate dictionaries are created automatically when they do not already exist.

Parameters:

Name	Type	Description	Default
`target_dict`	`dict`	Dictionary to update in place.	required
`dotted_key`	`str`	Dot-separated key path (e.g. `"training.max_epochs"`).	required
`value`		Value to assign at the leaf key.	required

Source code in lumina/trainer/opf/utils.py

def apply_nested(target_dict, dotted_key, value):
    """Set a value in a nested dictionary using a dot-separated key path.

    Intermediate dictionaries are created automatically when they do not
    already exist.

    Args:
        target_dict (dict): Dictionary to update in place.
        dotted_key (str): Dot-separated key path (e.g. ``"training.max_epochs"``).
        value: Value to assign at the leaf key.
    """
    if not isinstance(target_dict, dict):
        return
    if not isinstance(dotted_key, str):
        return

    keys = dotted_key.split(".")
    current = target_dict
    for key in keys[:-1]:
        if key not in current or not isinstance(current[key], dict):
            current[key] = {}
        current = current[key]
    current[keys[-1]] = value

Trainer API¶

Trainers¶

BaseOPFTrainer ¶

forward(batch) ¶

save_checkpoint(filepath) ¶

train() ¶

OPFTrainer ¶

train_epoch(epoch) ¶

validate() ¶

MultiCaseOPFTrainer ¶

train_epoch(epoch) ¶

validate() ¶

Utilities¶

parse_case_name(case_input: str) -> str ¶

parse_cases_arg(cases_arg) ¶

get_case_name_mapping() ¶

resolve_hetero_model_type(model_type=None, model_class_path=None, default='HeteroGNN') ¶

build_hetero_model_spec(model_type, metadata, input_channels, models_config, out_channels=2) ¶

initialize_model(model, sample_data, device) ¶

apply_nested(target_dict, dotted_key, value) ¶

`BaseOPFTrainer` ¶

`forward(batch)` ¶

`save_checkpoint(filepath)` ¶

`train()` ¶

`OPFTrainer` ¶

`train_epoch(epoch)` ¶

`validate()` ¶

`MultiCaseOPFTrainer` ¶

`train_epoch(epoch)` ¶

`validate()` ¶

`parse_case_name(case_input: str) -> str` ¶

`parse_cases_arg(cases_arg)` ¶

`get_case_name_mapping()` ¶

`resolve_hetero_model_type(model_type=None, model_class_path=None, default='HeteroGNN')` ¶

`build_hetero_model_spec(model_type, metadata, input_channels, models_config, out_channels=2)` ¶

`initialize_model(model, sample_data, device)` ¶

`apply_nested(target_dict, dotted_key, value)` ¶