API Reference

Trainer

class accmt.Trainer(hps_config: str | dict | HyperParameters, model_path: str, track_name: str | None = None, enable_checkpointing: bool = True, multiple_checkpoints: bool = False, max_checkpoints: int | None = None, resume: bool | int | None = None, disable_model_saving: bool = False, patience: int | dict[str, Any] | None = None, evaluate_every_n_steps: int | None = None, checkpoint_every: str | None = 'epoch', logging_dir: str = 'logs', log_with: str | None = None, log_every: int | None = -1, grad_accumulation_steps: int | None = None, gradient_checkpointing: bool = False, gradient_checkpointing_kwargs: dict[str, Any] | None = None, clip_grad: float | None = 1.0, set_to_none: bool = True, shuffle_train: bool = True, sampler: Any | list | None = None, collate_fn: Callable | None = None, collate_fn_train: Callable | None = None, collate_fn_val: Callable | None = None, max_shard_size: str = '10GB', safe_serialization: bool = False, compile: bool = False, compile_kwargs: dict[str, Any] | None = None, safe_mode: bool = True, train_loss_metric_name: str = 'train_loss', val_loss_metric_name: str = 'val_loss', dataloader_pin_memory: bool = True, dataloader_num_workers: int | None = None, dataloader_drop_last: bool = False, eval_when_finish: bool = True, eval_when_start: bool = False, monitor: Monitor | None = None, metrics: Metric | list[Metric] | dict[Any, Metric | list[Metric]] | None = None, cleanup_cache_every_n_steps: int | None = None, callback: Callback | list[Callback] | None = None, additional_tracker_config: dict[str, Any] | None = None, batch_device_placement: bool = True, prepare_batch: bool = True, safe_steps: bool = True, destroy_after_training: bool = True, enable_prepare_logging: bool = False, **kwargs: Any | None)[source]

Class to implement full training process.

__init__(hps_config: str | dict | HyperParameters, model_path: str, track_name: str | None = None, enable_checkpointing: bool = True, multiple_checkpoints: bool = False, max_checkpoints: int | None = None, resume: bool | int | None = None, disable_model_saving: bool = False, patience: int | dict[str, Any] | None = None, evaluate_every_n_steps: int | None = None, checkpoint_every: str | None = 'epoch', logging_dir: str = 'logs', log_with: str | None = None, log_every: int | None = -1, grad_accumulation_steps: int | None = None, gradient_checkpointing: bool = False, gradient_checkpointing_kwargs: dict[str, Any] | None = None, clip_grad: float | None = 1.0, set_to_none: bool = True, shuffle_train: bool = True, sampler: Any | list | None = None, collate_fn: Callable | None = None, collate_fn_train: Callable | None = None, collate_fn_val: Callable | None = None, max_shard_size: str = '10GB', safe_serialization: bool = False, compile: bool = False, compile_kwargs: dict[str, Any] | None = None, safe_mode: bool = True, train_loss_metric_name: str = 'train_loss', val_loss_metric_name: str = 'val_loss', dataloader_pin_memory: bool = True, dataloader_num_workers: int | None = None, dataloader_drop_last: bool = False, eval_when_finish: bool = True, eval_when_start: bool = False, monitor: Monitor | None = None, metrics: Metric | list[Metric] | dict[Any, Metric | list[Metric]] | None = None, cleanup_cache_every_n_steps: int | None = None, callback: Callback | list[Callback] | None = None, additional_tracker_config: dict[str, Any] | None = None, batch_device_placement: bool = True, prepare_batch: bool = True, safe_steps: bool = True, destroy_after_training: bool = True, enable_prepare_logging: bool = False, **kwargs: Any | None)[source]

Trainer constructor to set configuration.

Parameters:
  • hps_config (str, dict, or HyperParameters) – YAML hyperparameters file path, dictionary or HyperParameters.

  • model_path (str) – Path to save model.

  • track_name (str, optional, defaults to None) – Track name for trackers. If set to None (default), the track name will be the model’s folder name.

  • enable_checkpointing (bool, optional, defaults to True) – Enable checkpointing.

  • multiple_checkpoints (bool, optional, defaults to False) – Enable multiple checkpoints.

  • max_checkpoints (int, optional, defaults to None) – Maximum number of checkpoints to keep. If set to None, all checkpoints will be kept.

  • resume (bool or int, optional, defaults to None) – Whether to resume from checkpoint. Default option is None, which means resuming from checkpoint will be handled automatically, whether the checkpoint directory exists or not. If set to True, the latest checkpoint will be loaded. If set to an integer, the checkpoint will be loaded from the given index (if multiple_checkpoints is True). If set to -1, the latest checkpoint will be loaded (if multiple_checkpoints is True).

  • disable_model_saving (bool, optional, defaults to False) – Disable any model saving registered (by default, “best_valid_loss” is registered, or if there are none evaluations to do, default will be “best_train_loss”).

  • patience (int or dict, optional, defaults to None) – Set up a patience parameter for model savings. If set, every model saving will check if the previous metric was higher. If the metric has not improved over the N model savings (patience), then the training process will stop. Can also implement patience per model saving in a dictionary.

  • evaluate_every_n_steps (int, optional, defaults to None) – Evaluate model in validation dataset (if implemented) every N steps. If this is set to None (default option), evaluation will happen at the end of every epoch.

  • checkpoint_every (str, optional, defaults to epoch) –

    Checkpoint every N epochs, steps or evaluations. Requires a number and a unit in a string. The following examples are valid:

    • ”epoch”, “ep”, “1epoch”, “1ep”, “1 epoch”, “1 ep”: 1 Epoch

    • ”step”, “st”, “1step”, “1st”, “1 step”, “1 st”: 1 Step

    • ”evaluation”, “eval”, “1evaluation”, “1eval”, “1 evaluation”, “1 eval”: 1 Evaluation

    (a character s at the end of the string is also valid)

    If set to None, checkpointing will be disabled.

  • logging_dir (str, optional, defaults to logs) – Path where to save logs to show progress. It can be an IP address (local or remote), HTTP or HTTPS link, or simply a directory.

  • log_with (str, optional, defaults to None) –

    Logger to log metrics. It can be one of the following:
    • mlflow

    NOTE: MLFlow is the only one supported right now. Other trackers are not currently available.

  • log_every (int, optional, defaults to -1) – Log train loss every N steps. If set to -1, training loss will be logged at the end of every epoch (or if gradient accumulation is enabled, the value will be the length of the training dataloader divided by the number of accumulation steps). If gradient accumulation is enabled and the value is not -1, this value will be multiplied by the number of accumulation steps.

  • grad_accumulation_steps (int, optional, defaults to None) – Accumulate gradients for N steps. Useful for training large models and simulate large batches when memory is not enough. If set to None or 1, no accumulation will be perfomed.

  • gradient_checkpointing (bool, optional, defaults to False) – Use gradient checkpointing. It requires a gradient_checkpointing_enable method in the model (models from HuggingFace’s transformers library have this method already implemented) with a single argument gradient_checkpointing_kwargs (can be a dictionary or None).

  • gradient_checkpointing_kwargs (dict, optional, defaults to None) – Keyword arguments for gradient_checkpointing_enable method.

  • clip_grad (float, optional, defaults to 1.0) – Performs gradient clipping in between backpropagation and optimizer’s step function.

  • set_to_none (bool, optional, defaults to True) – From PyTorch documentation: “instead of setting to zero, set the grads to None. This will in general have lower memory footprint, and can modestly improve performance.” Some optimizers have a different behaviour if the gradient is 0 or None. See PyTorch docs for more information: https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html

  • shuffle_train (bool, optional, defaults to True) – Whether to shuffle train DataLoader.

  • sampler (list or Any, optional, defaults to None) – Sampler (or list of samplers) for train DataLoader.

  • collate_fn (Callable, optional, defaults to None) – Collate function to be implemented in both train and validation dataloaders.

  • collate_fn_train (Callable, optional, defaults to None) – Collate function to be implemented in train dataloader. Cannot be imlpemented if collate_fn was already declared.

  • collate_fn_val (Callable, optional, defaults to None) – Collate function to be implemented in validation dataloader. Cannot be implemented if collate_fn was already declared.

  • max_shard_size (str, optional, defaults to 10GB) – Max model shard size to be used.

  • safe_serialization (bool, optional, defaults to False) – Whether to save model using safe tensors or the traditional PyTorch way. If True, some tensors will be lost.

  • compile (bool, optional, defaults to False) – Whether to call torch.compile on model (and teacher, if implemented).

  • compile_kwargs (dict, optional, defaults to None) – torch.compile kwargs for additional customization.

  • safe_mode (bool, optional, defaults to True) –

    Run forward passes of the model in safe mode. This means that the forward pass of the model will run through the corresponding wrapper (DDP, FSDP or DeepSpeedEngine). If not running in safe mode, forward pass will skip the wrapper and run directly on the module (instance of nn.Module). Running with safe mode disabled will slightly improve throughput, although gradients consistency and mixed precision could be affected because skipping the wrapper’s forward pass might skip internal parallel functionality.

    NOTE: This parameter takes no effect running with FSDP since forward passes are already done through this wrapper.

  • train_loss_metric_name (str, optional, defaults to train_loss) – Metric name for train loss in logs.

  • val_loss_metric_name (str, optional, defaults to val_loss) – Metric name for validation loss in logs.

  • dataloader_pin_memory (bool, optional, defaults to True) – Enables pin memory option in DataLoader (only if GPU is enabled).

  • dataloader_num_workers (int, optional, defaults to None) – Number of processes for DataLoader. This defaults to None, meaning the number of workers will be equal to the number of processes set for training.

  • dataloader_drop_last (bool, optional, defaults to False) – Whether to drop last batch on DataLoader or not.

  • eval_when_finish (bool, optional, defaults to True) – At the end of training, evaluate model on validation dataset (if available). This option is only valid when evaluate_every_n_steps is not None.

  • eval_when_start (bool, optional, defaults to False) – Start training with evaluation (if available).

  • monitor (Monitor or dict, optional, defaults to None) –

    Monitor arguments to keep track of variables during training. If not specified, ‘train_loss’ and ‘validation_loss’ will be set to True by default.

    NOTE: Learning rate, GPU and CPU monitoring will only be reported during training, not evaluation. Also, GPU and CPU monitoring will only be reported on main process (index 0).

  • metrics (Metric, list or dict, optional, defaults to None) – List of additional metrics of type ‘Metric’ to track. When doing multiple evaluations, this should be a dictionary of metrics (or list of metrics), where each key corresponds to the dataset to evaluate (specified in val_dataset in fit function) and the value corresponds to a Metric or list of metrics. If metrics are given as only Metric or list of metrics, these metrics will apply for all evaluations. If you want specific metrics for specific evaluations, consider dividing your metrics per validation dataset in a dictionary.

  • cleanup_cache_every_n_steps (int, optional, defaults to None) –

    Cleanup CPU and CUDA caches every N steps. Default is no cleanup.

    NOTE: On every epoch and evaluation call we cleanup cache.

  • callback (Callback or list, optional, defaults to None) – Callback or callbacks to implement.

  • additional_tracker_config (dict, optional, defaults to None) – Additional configuration specification for tracker (e.g. hyper-parameters).

  • batch_device_placement (bool, optional, defaults to True) – Move batches to correct device automatically. If False, batches will be in CPU.

  • prepare_batch (bool, optional, defaults to True) – Prepares a batch dynamically when using Mixed Precision. When using DeepSpeed, we need to scale down the floating point tensors to be able to do calculations with the model. If not using DeepSpeed, this argument takes no effect.

  • safe_steps (bool, optional, defaults to True) – Run safe training and validation steps to avoid OOMs (Out Of Memory errors) and retry steps. If a retry does not solve the problem, a list of users using GPUs will pop up and the OOM error will raise.

  • destroy_after_training (bool, optional, defaults to True) – Destroy the process group after training. Set to False if you’re running multiple trainings in the same script.

  • enable_prepare_logging (bool, optional, defaults to False) – Enable internal model preparation logging. When using DeepSpeed, there are many messages that appear in the terminal that can be annoying.

  • kwargs (Any, optional) – Extra arguments for specific init function in Tracker, e.g. run_name, tags, etc.

fit(module: AcceleratorModule | str | tuple[str, str] | tuple[str, Any], train_dataset: Dataset | None = None, val_dataset: Dataset | list[Dataset] | dict[str, Dataset] | None = None, **kwargs: Any)[source]

Function to train a given AcceleratorModule.

Parameters:
  • module (AcceleratorModule, str or tuple) – AcceleratorModule class containig the training logic. This can also be a string specifying a HuggingFace model, or a tuple of type (model, type), where ‘model’ is a string for the HuggingFace model, and ‘type’ is a string or class (from transformers library) for the model type.

  • train_dataset (torch.utils.data.Dataset, optional, defaults to None) – Dataset class from PyTorch containing the train dataset logic. If not provided, then get_train_dataloader from module will be used to get the train DataLoader.

  • val_dataset (torch.utils.data.Dataset, list or dict, optional, defaults to None) –

    Dataset class from PyTorch containing the validation dataset logic. This can also be a list or a dictionary of Dataset, in that case, multiple evaluations will run following the logic of validation_step and specified metrics. Metric names reported for a multiple evaluation setting will add a ‘_’ followed by a key related to the dataset (e.g. ‘accuracy_1’ or ‘accuracy_another_dataset’).

    If this dataset is not specified, then the validation logic of AcceleratorModule (if specified) will be skipped.

  • kwargs (Any) – Keyword arguments for from_pretrained function for model initialization.

log_artifact(path: str)[source]

Logs an artifact to the current run.

Parameters:

path (str) – Path to the file to be logged as an artifact.

log_artifacts(path: str)[source]

Logs multiple artifacts from a directory to the current run.

Parameters:

path (str) – Path to the directory to be logged as an artifact.

register_model_saving(model_saving: str, saving_below: float | None = None, saving_above: float | None = None)[source]

Register a type of model saving.

Parameters:
  • model_saving (str) – Type of model saving. It can be “best_valid_loss” (default), “best_train_loss” or in format of “best_{METRIC}”. NOTE: “best_” is optional. Also, all metrics should relate directly to metrics and validation datasets. This can also be in the form of “best_{METRIC}@{DATASET}” (metric at a specific dataset), “best_{METRIC}@{DATASET1}@{DATASET2}” (metric at dataset1 and dataset2), “best_{METRIC1}@{DATASET1}/{METRIC1}@{dataset2}” (best metric1 at dataset1 and best metric2 at dataset2), “best_{METRIC1}/{METRIC2}@{DATASET2}” (best metric1 between all datasets containing this metric and best metric2 at dataset2 only), etc.

  • saving_below (float, optional, defaults to None) – Register this model saving to only be saved whenever its values are lower than this.

  • saving_above (float, optional, defaults to None) – Register this model saving to only be saved whenever its values are above than this.

AcceleratorModule

class accmt.AcceleratorModule[source]

Super class to define training and validation logic without the need to write a training loop.

The constructor of this class must implement self.model, specifying the model from torch.nn.Module. self.teacher is also a reserved property for teacher-student approaches.

__call__(*args: Any, **kwargs: Any)[source]

Call self as a function.

__len__()[source]
forward(*args: Any, **kwargs: Any) Tensor[source]

Defines the flow of data.

freeze(module: Module)[source]

Freeze all parameters inside a module.

Parameters:

module (nn.Module) – Module where all parameters will have requires_grad set to False.

get_optimizer() Optimizer[source]

Defines a custom PyTorch optimizer logic here.

get_train_dataloader(dataset: Dataset) DataLoader[source]

Defines a custom PyTorch DataLoader class for training.

get_validation_dataloader(dataset: Dataset) DataLoader[source]

Defines a custom PyTorch DataLoader class for validation.

log(values: dict[str, Tensor | float | int], step: int, reduction: Literal['sum', 'mean'] = 'mean')[source]

Log metrics to the tracker every N steps (defined in Trainer). If you want to apply any other logic, consider using self.tracker.log directly. This function will reduce tensors across all processes and only the main process will log the metrics.

Parameters:
  • values (dict) – Dictionary of metrics to log. If values are tensors, they will be reduced across all processes. If values are not tensors, the ones from the main process will be logged.

  • step (int) – Step number to log the metrics. Can access self.state.global_step to log the current step, self.state.train_step or self.state.val_step.

  • reduction (str, optional, defaults to mean) – Reduction method to apply to tensors. Available options are sum and mean. Only applicable if values are tensors.

pad(tensor: Tensor | list[Tensor] | tuple[Tensor, ...], value: int | float, padding: Literal['max_length', 'longest'] | None = None, max_length: int | None = None, side: Literal['left, right'] = 'right', op: str | Callable | None = None) Tensor | list[Tensor] | tuple[Tensor, ...][source]

Pad last dimension of tensors to a given ‘max_length’ or to the longest tensor in an iterable (tuple or list).

Parameters:
  • tensor (torch.Tensor, list or tuple) – Single tensor or an iterable of tensors to be padded.

  • value (int or float) – Constant value to be added when padding.

  • padding (str, optional, defaults to None) – Padding strategy to apply. longest means that all tensors in an iterable will be padded to the longest tensor, and max_length will pad all tensors to a given max_length. NOTE: A single tensor can only be padded to max_length. If padding is not specified, its value will default to longest for iterables and max_length for single tensors.

  • max_length (int, optional, defaults to None) – Max length for tensors to calculate remaining padding amount. This applies only when padding is set to max_length or tensor is a single tensor.

  • side (str, optional, defaults to right) – Padding side. Available options are right and left.

  • op (str, optional, defaults to None) – PyTorch operation to do after tensors are padded. Options can be stack, cat or a function. Only applicable for iterable of tensors.

Returns:

Padded tensors.

Return type:

(torch.Tensor, list or tuple)

training_step(batch: Any) Tensor[source]

Defines the training logic. Must return a loss tensor (scalar).

unfreeze(module: Module)[source]

Unfreeze all parameters inside a module.

Parameters:

module (nn.Module) – Module where all parameters will have requires_grad set to True.

validation_step(key: str, batch: Any) dict | Tensor[source]

Defines the validation logic. Must return a dictionary containing each metric with predictions and targets, and also the loss value in the dictionary.

Example

``` # format is ==> “metric”: (predictions, targets, …) return {

“loss”: validation_loss_tensor, # (scalar tensor) # with additional metrics: “accuracy”: (accuracy_predictions, accuracy_targets), “bleu”: (bleu_predictions, bleu_targets)

ExtendedAcceleratorModule

class accmt.ExtendedAcceleratorModule[source]

Extended module from AcceleratorModule to enhance training_step function. This means that the backpropagation part must be done manually.

Example

``` class Module(ExtendedAcceleratorModule):

# other logic remains the same

def training_step(self, batch):

loss = … self.backward(loss) self.step_optimizer() self.step_scheduler()

return loss # loss will only be used to log metrics.

```

NOTE: grad_accumulation_steps in fit function from Trainer will not work. If you want to accumulate gradients and then backpropagate, you may want to make use of self.status_dict[“epoch_step”].

backward(loss: Tensor, **kwargs)[source]

Performs backward operation.

Parameters:
  • loss (torch.Tensor) – Scalar loss tensor to backward.

  • kwargs (Any) – Extra arguments to be passed to ‘accelerator.backward’ function.

step()[source]

Step optimizer and scheduler (in that order). If there is no scheduler, it will be ignored.

step_optimizer()[source]
step_scheduler()[source]
zero_grad(set_to_none: bool = True)[source]

Call optimizer’s ‘zero_grad’ operation to reset gradients.

Parameters:

set_to_none (bool, optional, defaults to True) – Set gradients to None instead of 0.

States

class accmt.states.TrainingState(global_step: int = 0, train_step: int = 0, val_step: int = 0, epoch: int = 0, is_end_of_epoch: bool = False, is_last_training_batch: bool = False, is_last_validation_batch: bool = False, is_last_epoch: bool = False, evaluations_done: int = 0, additional_metrics: dict[str, dict[str, ~typing.Any]] = <factory>, patience_left: dict[str, int] = <factory>, best_train_loss: float = inf, finished: bool = False, num_checkpoints_made: int = 0)[source]

General training state.

Parameters:
  • global_step (int) – Global step index. This is incremented every time a train step is done.

  • train_step (int) – Training step index inside a training loop (can be considered as batch index).

  • val_step (int) – Validation step index inside an evaluation loop (can be considered as batch index).

  • epoch (int) – Epoch index.

  • is_end_of_epoch (bool) – Flag to check if current state is at the end of an epoch.

  • is_last_training_batch (bool) – Flag to check if current state is processing the last training batch.

  • is_last_validation_batch (bool) – Flag to check if current state is processing the last validation batch

  • is_last_epoch (bool) – Flag to check if current state is processing the last epoch.

  • evaluations_done (int) – Number of evaluations done.

  • additional_metrics (dict) – Additional metrics (e.g. accuracy, bleu, f1, etc).

  • model_savings (dict) – Bests model saving values.

  • patience_left (dict) – Patience left per model saving (in case it’s implemented, otherwise values are set to -1).

  • best_train_loss (float) – Best training loss achieved.

  • best_valid_loss (float) – Best validation loss achieved.

  • finished (bool, optional, defaults to False) – Flag to identify if the process has already finished.

  • num_checkpoints_made (int, optional, defaults to 0) – Number of checkpoints made.

additional_metrics: dict[str, dict[str, Any]]
best_train_loss: float = inf
epoch: int = 0
evaluations_done: int = 0
finished: bool = False
global_step: int = 0
is_end_of_epoch: bool = False
is_last_epoch: bool = False
is_last_training_batch: bool = False
is_last_validation_batch: bool = False
num_checkpoints_made: int = 0
patience_left: dict[str, int]
train_step: int = 0
val_step: int = 0

Callbacks

class accmt.callbacks.Callback(module: AcceleratorModule = None, state: TrainingState = None)[source]

Callback module containing different callback functions for different stages of the traininig process.

NOTE: Every callback function will run on every process. If you want your callback functions to only run on a single process, make sure to import accmt.decorators for different function decorators.

Variables:
  • module (AcceleratorModule) – Training module.

  • trainer (Trainer) – Defined Trainer class.

  • state (TrainingState) – Reference to TrainingState class.

on_fit_start(*optional*)[source]

Callback when training process starts.

on_fit_end(*optional*)[source]

Callback when training process ends.

on_before_backward(*optional*)[source]

Callback before engine’s backward.

on_after_backward(*optional*)[source]

Callback after engine’s backward.

on_before_optimizer_step(*optional*)[source]

Callback before optimizers steps.

on_after_optimizer_step(*optional*)[source]

Callback after optimizer steps.

on_before_scheduler_step(*optional*)[source]

Callback before scheduler steps:

on_after_scheduler_step(*optional*)[source]

Callback after scheduler steps.

on_before_zero_grad(*optional*)[source]

Callback before optimizer resets gradients.

on_after_zero_grad(*optional*)[source]

Callback after optimizer resets gradients.

on_runtime_error(*optional*)[source]

Callback when process raises a RunTimeError exception.

on_cuda_out_of_memory(*optional*)[source]

Callback when process raises a RunTimeError exception with CUDA Out Of Memory.

on_keyboard_interrupt(*optional*)[source]

Callback when process raises a KeyboardInterrupt exception.

on_exception(*optional*)[source]

Callback when process raises any other Exception different than RuntimeError and KeyboardInterrupt

on_resume(*optional*)[source]

Callback when resuming training process.

on_save_checkpoint(*optional*)[source]

Callback when saving checkpoint.

on_before_training_step(*optional*)[source]

Callback before training_step function.

on_after_training_step(*optional*)[source]

Callback after training_step function.

on_before_validation_step(*optional*)[source]

Callback before validation_step function.

on_after_validation_step(*optional*)[source]

Callback after validation_step function.

on_epoch_start(*optional*)[source]

Callback when an epoch starts.

on_epoch_end(*optional*)[source]

Callback when an epoch ends.

on_evaluation_start(*optional*)[source]

Callback when evaluation starts.

on_evaluation_end(*optional*)[source]

Callback when evaluation ends.

module: AcceleratorModule = None
on_after_backward()[source]

Callback after engine’s backward.

on_after_optimizer_step(optimizer: Optimizer)[source]

Callback after optimizer steps.

Parameters:

optimizer (Optimizer) – Wrapped optimizer.

on_after_scheduler_step(scheduler: LRScheduler)[source]

Callback after scheduler steps.

Parameters:

scheduler (LRScheduler) – Wrapped scheduler.

on_after_training_step()[source]

Callback after training_step function.

on_after_validation_step()[source]

Callback after validation_step function.

on_after_zero_grad(optimizer: Optimizer)[source]

Callback after optimizer resets gradients.

Parameters:

optimizer (Optimizer) – Wrapped optimizer.

on_before_backward(loss: Tensor)[source]

Callback before engine’s backward.

Parameters:

loss (torch.Tensor) – Scalar loss tensor.

on_before_optimizer_step(optimizer: Optimizer)[source]

Callback before optimizers steps.

Parameters:

optimizer (Optimizer) – Wrapped optimizer.

on_before_scheduler_step(scheduler: LRScheduler)[source]

Callback before scheduler steps:

Parameters:

scheduler (LRScheduler) – Wrapped scheduler.

on_before_training_step(batch: Any)[source]

Callback before training_step function.

Parameters:

batch (Any) – Dataloader’s batch.

on_before_validation_step(batch: Any)[source]

Callback before validation_step function.

Parameters:

batch (Any) – Dataloader’s batch.

on_before_zero_grad(optimizer: Optimizer)[source]

Callback before optimizer resets gradients.

Parameters:

optimizer (Optimizer) – Wrapped optimizer.

on_cuda_out_of_memory(exception: Exception)[source]

Callback when process raises a RunTimeError exception with CUDA Out Of Memory.

Parameters:

exception (Exception) – Raised exception.

on_epoch_end()[source]

Callback when an epoch ends.

on_epoch_start()[source]

Callback when an epoch starts.

on_evaluation_end()[source]

Callback when evaluation ends.

on_evaluation_start()[source]

Callback when evaluation starts.

on_exception(exception: Exception)[source]

Callback when process raises any other Exception different than RuntimeError and KeyboardInterrupt

Parameters:

exception (Exception) – Raised exception.

on_fit_end()[source]

Callback when training process ends.

on_fit_start()[source]

Callback when training process starts.

on_keyboard_interrupt(exception: Exception)[source]

Callback when process raises a KeyboardInterrupt exception.

Parameters:

exception (Exception) – Raised exception.

on_resume()[source]

Callback when resuming training process.

on_runtime_error(exception: Exception)[source]

Callback when process raises a RunTimeError exception.

Parameters:

exception (Exception) – Raised exception.

on_save_checkpoint()[source]

Callback when saving checkpoint.

state: TrainingState = None
trainer = None

Metrics

class accmt.metrics.Metric(name: str, greater_is_better: bool = True, main_metric: str | None = None, do_checks: bool = True, cast: dtype | str | None = torch.float32)[source]

Compute metrics on main process.

__init__(name: str, greater_is_better: bool = True, main_metric: str | None = None, do_checks: bool = True, cast: dtype | str | None = torch.float32)[source]

Set a module to compute metrics. All computations are done in main process.

Parameters:
  • name (str) – Metric’s module name.

  • greater_is_better (bool, optional, defaults to True) – Specify if the main metric is better when is greater.

  • main_metric (str, optional, defaults to None) – Determine which is the main metric key in your compute output. By default, main metric key will be equal to the ‘name’ parameter.

  • do_checks (bool, optional, defaults to True) – Enable shape checks when appending metrics. This can be disabled for small speed improvements.

  • cast (dtype or str, optional, defaults to torch.float32) – Cast all floating point tensors to the desired dtype. If None, no upcasting will be done.

compute(*args: Tensor | dict[Any, Tensor]) dict[source]

Compute metrics with the given arguments. This function returns a dictionary containing the main metric value and others.

Example

``` def compute(self, predictions, references):

# logic of how to calculate metrics here…

return {

“accuracy”: 0.85, # <– this one is the main value “f1”: 0.89

}

```

NOTE: In the previous example, the main metric is ‘accuracy’, and its value is gonna be used along with ‘comparator’ to compare if the metric is the best or not. By default, main metric is set to the name of the metric itself. You can change this behaviour with ‘main_metric’ on class initialization.

class accmt.metrics.MetricParallel(name: str, greater_is_better: bool = True, main_metric: str | None = None, do_checks: bool = True)[source]

Compute metrics in parallel.

__init__(name: str, greater_is_better: bool = True, main_metric: str | None = None, do_checks: bool = True)[source]

Set a module to compute metrics. All computations are done in parallel. When reporting values, these are averaged between all the processes.

Parameters:
  • name (str) – Metric’s module name.

  • greater_is_better (bool, optional, defaults to True) – Specify if the main metric is better when is greater.

  • main_metric (str, optional, defaults to None) – Determine which is the main metric key in your compute output. By default, main metric key will be equal to the ‘name’ parameter.

  • do_checks (bool, optional, defaults to True) – Enable shape checks when appending metrics. This can be disabled for small speed improvements.

compute(*args: Tensor | dict[Any, Tensor]) dict

Compute metrics with the given arguments. This function returns a dictionary containing the main metric value and others.

Example

``` def compute(self, predictions, references):

# logic of how to calculate metrics here…

return {

“accuracy”: 0.85, # <– this one is the main value “f1”: 0.89

}

```

NOTE: In the previous example, the main metric is ‘accuracy’, and its value is gonna be used along with ‘comparator’ to compare if the metric is the best or not. By default, main metric is set to the name of the metric itself. You can change this behaviour with ‘main_metric’ on class initialization.

DataCollators

class accmt.collate_fns.DataCollatorForSeq2Seq(tokenizer: Any, label_pad_token_id: int = -100)[source]

Automatically adds efficient padding for ‘inputs’, ‘attention_mask’ and ‘labels’. This works for multiple inputs from the dataset logic. If any of the objects does not correspond to a dictionary-like structure of a decoded tokenizer’s output, it will apply the default collate function derived from PyTorch.

The output of a dictionary-like with key ‘input_ids’ will have the following keys:
  • input_ids

  • attention_mask (if found)

  • labels (if found)

This implementation derives from transformers library: https://github.com/huggingface/transformers/blob/main/src/transformers/data/data_collator.py#L543

Parameters:
  • tokenizer (Any) – Tokenizer using HuggingFace standard.

  • label_pad_token_id (int, optional, defaults to -100) – Label pad token id. Labels with this value will be ignored in the training process.

__init__(tokenizer: Any, label_pad_token_id: int = -100)[source]
class accmt.collate_fns.DataCollatorForLanguageModeling(tokenizer: Any, mlm: bool = True, mlm_probability: float = 0.15, ignore_index: int = -100, masked_to_mask: float = 0.8, apply_random_words: bool = True, force_one_output: bool = False)[source]

Collator function to implement automatic language modeling, such as Masked Language Modeling.

Parameters:
  • tokenizer (Any) – Tokenizer using HuggingFace standard.

  • mlm (bool, optional, defaults to True) – Implements Masked Language Modeling.

  • mlm_probability (float, optional, defaults to 0.15) – How much masking is implemented in Masked Language Modeling.

  • ignore_index (int, optional, defaults to -100) – Label pad token id. Labels with this value will be ignored in the training process.

  • masked_to_mask (float, optional, defaults to 0.8) – Probability to replace masked input tokens with mask token. The half remaining percent will replace masked input tokens with random word, and the other half will keep the masked input tokens unchanged. If apply_random_words is set to False, then the entire remaining percent will be unchanged.

  • apply_random_words (bool, optional, defaults to True) – Whether to apply random words during Masked Language Modeling.

  • force_one_output (bool, optional, defaults to False) – Whether to force output one output. If Dataset object __getitem__ function returns a tuple, only the first element will be considered and extra targets will be dropped.

__init__(tokenizer: Any, mlm: bool = True, mlm_probability: float = 0.15, ignore_index: int = -100, masked_to_mask: float = 0.8, apply_random_words: bool = True, force_one_output: bool = False) dict | tuple[dict, Tensor][source]
class accmt.collate_fns.DataCollatorForLongestSequence(tokenizer: Any, torch_stack: bool = True)[source]

Automatically adds efficient padding for inputs, while preserving static labels.

If output of __getitem__ Dataset logic looks like:

return x, y (x being a dictionary containing keys input_ids and attention_mask)

then the output of the collator function will be (x, y), x being the padded inputs with the same keys and y the stacked labels.

If output of __getitem__ Dataset logic looks like:

return x (x being a dictionary containing keys input_ids and attention_mask)

then the output of the collator function will be x, being the padded inputs with the same keys.

NOTE: This collator should be used when labels on your dataset logic are not sequences. If that’s the case, see DataCollatorForSeq2Seq.

Parameters:

tokenizer (Any) – Tokenizer using HuggingFace standard.

__init__(tokenizer: Any, torch_stack: bool = True)[source]

Monitor

class accmt.monitor.Monitor(learning_rate: bool = False, epoch: bool = True, train_loss: bool = True, validation_loss: bool = True, additional_metrics: bool = True, grad_norm: bool = False, gpu_utilization: bool = False, cpu_utilization: bool = False, checkpoint: bool = False)[source]

Class to set metrics to monitor during training using a tracker (if implemented).

Parameters:
  • learning_rate (bool, optional, defaults to False) – Monitor learning rate.

  • epoch (bool, optional, defaults to True) – Monitor current epoch.

  • train_loss (bool, optional, defaults to True) – Monitor training loss.

  • validation_loss (bool, optional, defaults to True) – Monitor validation loss.

  • accuracy (bool, optional, defaults to True) – Monitor accuracy if implemented.

  • grad_norm (bool, optional, defaults to False) – This will enable monitoring for gradient normalization. This feature is not yet supported when running with DeepSpeed.

  • gpu_utilization (bool, optional, defaults to False) – Monitor GPU utilization in GB. It only reports GPU from main process (for now).

  • cpu_utilization (bool, optional, defaults to False) – Monitor CPU utilization in GB. It only reports CPU from main process (for now).

  • checkpoint (bool, optional, defaults to False) – Monitor checkpoint.

__init__(learning_rate: bool = False, epoch: bool = True, train_loss: bool = True, validation_loss: bool = True, additional_metrics: bool = True, grad_norm: bool = False, gpu_utilization: bool = False, cpu_utilization: bool = False, checkpoint: bool = False)[source]

HyperParameters

class accmt.hyperparameters.HyperParameters(epochs: int = 1, max_steps: int | None = None, batch_size: int | tuple[int] = 1, optimizer: str | Optimizer = 'SGD', optim_kwargs: dict | None = None, scheduler: str | Scheduler | None = None, scheduler_kwargs: dict | None = None, step_scheduler_per_epoch: bool = False)[source]

Class to set hyperparameters for training.

Parameters:
  • epochs (int, optional, defaults to 1) – Number of epochs (how many times we run the model over the dataset).

  • max_steps (int, optional, defaults to None) – Maximum number of steps to train for. If set, overrides epochs.

  • batch_size (int or tuple, optional, defaults to 1) –

    Batch size (how many samples are passed to the model at the same time). This can also be a tuple, the first element indicating batch size during training, and the second element indicating batch size during evaluation.

    NOTE: This is not effective batch size. Effective batch size will be calculated multiplicating this value by the number of processes.

  • optimizer (str or Optimizer, optional, defaults to SGD) – Optimization algorithm. See documentation to check the available ones.

  • optim_kwargs (dict, optional, defaults to None) – Specific optimizer keyword arguments.

  • scheduler (str or Scheduler, optional, defaults to None) – Learning rate scheduler to implement.

  • scheduler_kwargs (dict, optional, defaults to None) – Specific scheduler keyword arguments.

  • step_scheduler_per_epoch (bool, optional, defaults to False) – Step scheduler per epoch instead of per step.

__init__(epochs: int = 1, max_steps: int | None = None, batch_size: int | tuple[int] = 1, optimizer: str | Optimizer = 'SGD', optim_kwargs: dict | None = None, scheduler: str | Scheduler | None = None, scheduler_kwargs: dict | None = None, step_scheduler_per_epoch: bool = False)[source]

Optimizers

class accmt.hyperparameters.Optimizer[source]
class ASGD(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.01, lambd: float = 0.0001, alpha: float = 0.75, t0: float = 1000000.0, weight_decay: float = 0, foreach: bool | None = None, maximize: bool = False, differentiable: bool = False, capturable: bool = False)

Implements Averaged Stochastic Gradient Descent.

It has been proposed in Acceleration of stochastic approximation by averaging.

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, Tensor, optional) – learning rate (default: 1e-2)

  • lambd (float, optional) – decay term (default: 1e-4)

  • alpha (float, optional) – power for eta update (default: 0.75)

  • t0 (float, optional) – point at which to start averaging (default: 1e6)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

  • differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)

  • capturable (bool, optional) – whether this instance is safe to capture in a CUDA graph. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)

step(closure=None)

Perform a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

class Adadelta(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 1.0, rho: float = 0.9, eps: float = 1e-06, weight_decay: float = 0, foreach: bool | None = None, *, capturable: bool = False, maximize: bool = False, differentiable: bool = False)

Implements Adadelta algorithm.

\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma \text{ (lr)}, \: \theta_0 \text{ (params)}, \: f(\theta) \text{ (objective)}, \: \rho \text{ (decay)}, \: \lambda \text{ (weight decay)} \\ &\textbf{initialize} : v_0 \leftarrow 0 \: \text{ (square avg)}, \: u_0 \leftarrow 0 \: \text{ (accumulate variables)} \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}if \: \lambda \neq 0 \\ &\hspace{10mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \\ &\hspace{5mm} v_t \leftarrow v_{t-1} \rho + g^2_t (1 - \rho) \\ &\hspace{5mm}\Delta x_t \leftarrow \frac{\sqrt{u_{t-1} + \epsilon }}{ \sqrt{v_t + \epsilon} }g_t \hspace{21mm} \\ &\hspace{5mm} u_t \leftarrow u_{t-1} \rho + \Delta x^2_t (1 - \rho) \\ &\hspace{5mm}\theta_t \leftarrow \theta_{t-1} - \gamma \Delta x_t \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]

For further details regarding the algorithm we refer to ADADELTA: An Adaptive Learning Rate Method.

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, Tensor, optional) – coefficient that scale delta before it is applied to the parameters (default: 1.0)

  • rho (float, optional) – coefficient used for computing a running average of squared gradients (default: 0.9). A higher value of rho will result in a slower average, which can be helpful for preventing oscillations in the learning process.

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-6).

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)

  • capturable (bool, optional) – whether this instance is safe to capture in a CUDA graph. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

  • differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)

step(closure=None)

Perform a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

class Adafactor(params, lr=None, eps=(1e-30, 0.001), clip_threshold=1.0, decay_rate=-0.8, beta1=None, weight_decay=0.0, scale_parameter=True, relative_step=True, warmup_init=False)

AdaFactor pytorch implementation can be used as a drop in replacement for Adam original fairseq code: https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py

Paper: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost https://arxiv.org/abs/1804.04235 Note that this optimizer internally adjusts the learning rate depending on the scale_parameter, relative_step and warmup_init options. To use a manual (external) learning rate schedule you should set scale_parameter=False and relative_step=False.

Parameters:
  • params (Iterable[nn.parameter.Parameter]) – Iterable of parameters to optimize or dictionaries defining parameter groups.

  • lr (float, optional) – The external learning rate.

  • eps (Tuple[float, float], optional, defaults to (1e-30, 0.001)) – Regularization constants for square gradient and parameter scale respectively

  • clip_threshold (float, optional, defaults to 1.0) – Threshold of root mean square of final gradient update

  • decay_rate (float, optional, defaults to -0.8) – Coefficient used to compute running averages of square

  • beta1 (float, optional) – Coefficient used for computing running averages of gradient

  • weight_decay (float, optional, defaults to 0.0) – Weight decay (L2 penalty)

  • scale_parameter (bool, optional, defaults to True) – If True, learning rate is scaled by root mean square

  • relative_step (bool, optional, defaults to True) – If True, time-dependent learning rate is computed instead of external learning rate

  • warmup_init (bool, optional, defaults to False) – Time-dependent learning rate computation depends on whether warm-up initialization is being used

This implementation handles low-precision (FP16, bfloat) values, but we have not thoroughly tested.

Recommended T5 finetuning settings (https://discuss.huggingface.co/t/t5-finetuning-tips/684/3):

  • Training without LR warmup or clip_threshold is not recommended.

  • Disable relative updates

  • Use scale_parameter=False

  • Additional optimizer operations like gradient clipping should not be used alongside Adafactor

Example:

`python Adafactor(model.parameters(), scale_parameter=False, relative_step=False, warmup_init=False, lr=1e-3) `

Others reported the following combination to work well:

`python Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None) `

When using lr=None with [Trainer] you will most likely need to use [~optimization.AdafactorSchedule] scheduler as following:

```python from transformers.optimization import Adafactor, AdafactorSchedule

optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None) lr_scheduler = AdafactorSchedule(optimizer) trainer = Trainer(…, optimizers=(optimizer, lr_scheduler)) ```

Usage:

```python # replace AdamW with Adafactor optimizer = Adafactor(

model.parameters(), lr=1e-3, eps=(1e-30, 1e-3), clip_threshold=1.0, decay_rate=-0.8, beta1=None, weight_decay=0.0, relative_step=False, scale_parameter=False, warmup_init=False,

)

step(closure=None)

Performs a single optimization step

Parameters:

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class Adagrad(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.01, lr_decay: float = 0, weight_decay: float = 0, initial_accumulator_value: float = 0, eps: float = 1e-10, foreach: bool | None = None, *, maximize: bool = False, differentiable: bool = False, fused: bool | None = None)

Implements Adagrad algorithm.

\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma \text{ (lr)}, \: \theta_0 \text{ (params)}, \: f(\theta) \text{ (objective)}, \: \lambda \text{ (weight decay)}, \\ &\hspace{12mm} \tau \text{ (initial accumulator value)}, \: \eta\text{ (lr decay)}\\ &\textbf{initialize} : state\_sum_0 \leftarrow \tau \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm} \tilde{\gamma} \leftarrow \gamma / (1 +(t-1) \eta) \\ &\hspace{5mm} \textbf{if} \: \lambda \neq 0 \\ &\hspace{10mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \\ &\hspace{5mm}state\_sum_t \leftarrow state\_sum_{t-1} + g^2_t \\ &\hspace{5mm}\theta_t \leftarrow \theta_{t-1}- \tilde{\gamma} \frac{g_t}{\sqrt{state\_sum_t}+\epsilon} \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]

For further details regarding the algorithm we refer to Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, Tensor, optional) – learning rate (default: 1e-2)

  • lr_decay (float, optional) – learning rate decay (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • initial_accumulator_value (float, optional) – initial value of the sum of squares of gradients (default: 0)

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-10)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

  • differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)

  • fused (bool, optional) – whether the fused implementation (CPU only) is used. Currently, torch.float64, torch.float32, torch.float16, and torch.bfloat16 are supported. (default: None). Please note that the fused implementations does not support sparse or complex gradients.

share_memory()
step(closure=None)

Perform a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

class Adam(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.001, betas: tuple[float | Tensor, float | Tensor] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, *, foreach: bool | None = None, maximize: bool = False, capturable: bool = False, differentiable: bool = False, fused: bool | None = None, decoupled_weight_decay: bool = False)

Implements Adam algorithm.

\[ \begin{align}\begin{aligned}\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma \text{ (lr)}, \beta_1, \beta_2 \text{ (betas)},\theta_0 \text{ (params)},f(\theta) \text{ (objective)} \\ &\hspace{13mm} \lambda \text{ (weight decay)}, \: \textit{amsgrad}, \:\textit{maximize}, \: \epsilon \text{ (epsilon)} \\ &\textbf{initialize} : m_0 \leftarrow 0 \text{ ( first moment)}, v_0\leftarrow 0 \text{ (second moment)},\: v_0^{max}\leftarrow 0 \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\\end{split}\\\begin{split} &\hspace{5mm}\textbf{if} \: \textit{maximize}: \\ &\hspace{10mm}g_t \leftarrow -\nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}\textbf{else} \\ &\hspace{10mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}\textbf{if} \: \lambda \neq 0 \\ &\hspace{10mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \\ &\hspace{5mm}m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ &\hspace{5mm}v_t \leftarrow \beta_2 v_{t-1} + (1-\beta_2) g^2_t \\ &\hspace{5mm}\widehat{m_t} \leftarrow m_t/\big(1-\beta_1^t \big) \\ &\hspace{5mm}\textbf{if} \: amsgrad \\ &\hspace{10mm} v_t^{max} \leftarrow \mathrm{max}(v_{t-1}^{max},v_t) \\ &\hspace{10mm}\widehat{v_t} \leftarrow v_t^{max}/\big(1-\beta_2^t \big) \\ &\hspace{5mm}\textbf{else} \\ &\hspace{10mm}\widehat{v_t} \leftarrow v_t/\big(1-\beta_2^t \big) \\ &\hspace{5mm}\theta_t \leftarrow \theta_{t-1} - \gamma \widehat{m_t}/ \big(\sqrt{\widehat{v_t}} + \epsilon \big) \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\end{aligned}\end{align} \]

For further details regarding the algorithm we refer to Adam: A Method for Stochastic Optimization.

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, Tensor, optional) – learning rate (default: 1e-3). A tensor LR is not yet supported for all our implementations. Please use a float LR if you are not also specifying fused=True or capturable=True.

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • decoupled_weight_decay (bool, optional) – if True, this optimizer is equivalent to AdamW and the algorithm will not accumulate weight decay in the momentum nor variance. (default: False)

  • amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

  • capturable (bool, optional) – whether this instance is safe to capture in a CUDA graph. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)

  • differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)

  • fused (bool, optional) – whether the fused implementation is used. Currently, torch.float64, torch.float32, torch.float16, and torch.bfloat16 are supported. (default: None)

Note

The foreach and fused implementations are typically faster than the for-loop, single-tensor implementation, with fused being theoretically fastest with both vertical and horizontal fusion. As such, if the user has not specified either flag (i.e., when foreach = fused = None), we will attempt defaulting to the foreach implementation when the tensors are all on CUDA. Why not fused? Since the fused implementation is relatively new, we want to give it sufficient bake-in time. To specify fused, pass True for fused. To force running the for-loop implementation, pass False for either foreach or fused.

Note

A prototype implementation of Adam and AdamW for MPS supports torch.float32 and torch.float16.

step(closure=None)

Perform a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

class AdamW(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.001, betas: tuple[float | Tensor, float | Tensor] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False, *, maximize: bool = False, foreach: bool | None = None, capturable: bool = False, differentiable: bool = False, fused: bool | None = None)

Implements AdamW algorithm, where weight decay does not accumulate in the momentum nor variance.

\[ \begin{align}\begin{aligned}\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma \text{(lr)}, \: \beta_1, \beta_2 \text{(betas)}, \: \theta_0 \text{(params)}, \: f(\theta) \text{(objective)}, \: \epsilon \text{ (epsilon)} \\ &\hspace{13mm} \lambda \text{(weight decay)}, \: \textit{amsgrad}, \: \textit{maximize} \\ &\textbf{initialize} : m_0 \leftarrow 0 \text{ (first moment)}, v_0 \leftarrow 0 \text{ ( second moment)}, \: v_0^{max}\leftarrow 0 \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\\end{split}\\\begin{split} &\hspace{5mm}\textbf{if} \: \textit{maximize}: \\ &\hspace{10mm}g_t \leftarrow -\nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}\textbf{else} \\ &\hspace{10mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm} \theta_t \leftarrow \theta_{t-1} - \gamma \lambda \theta_{t-1} \\ &\hspace{5mm}m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ &\hspace{5mm}v_t \leftarrow \beta_2 v_{t-1} + (1-\beta_2) g^2_t \\ &\hspace{5mm}\widehat{m_t} \leftarrow m_t/\big(1-\beta_1^t \big) \\ &\hspace{5mm}\textbf{if} \: amsgrad \\ &\hspace{10mm} v_t^{max} \leftarrow \mathrm{max}(v_{t-1}^{max},v_t) \\ &\hspace{10mm}\widehat{v_t} \leftarrow v_t^{max}/\big(1-\beta_2^t \big) \\ &\hspace{5mm}\textbf{else} \\ &\hspace{10mm}\widehat{v_t} \leftarrow v_t/\big(1-\beta_2^t \big) \\ &\hspace{5mm}\theta_t \leftarrow \theta_t - \gamma \widehat{m_t}/ \big(\sqrt{\widehat{v_t}} + \epsilon \big) \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\end{aligned}\end{align} \]

For further details regarding the algorithm we refer to Decoupled Weight Decay Regularization.

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, Tensor, optional) – learning rate (default: 1e-3). A tensor LR is not yet supported for all our implementations. Please use a float LR if you are not also specifying fused=True or capturable=True.

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay coefficient (default: 1e-2)

  • amsgrad (bool, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)

  • capturable (bool, optional) – whether this instance is safe to capture in a CUDA graph. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)

  • differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)

  • fused (bool, optional) – whether the fused implementation is used. Currently, torch.float64, torch.float32, torch.float16, and torch.bfloat16 are supported. (default: None)

Note

The foreach and fused implementations are typically faster than the for-loop, single-tensor implementation, with fused being theoretically fastest with both vertical and horizontal fusion. As such, if the user has not specified either flag (i.e., when foreach = fused = None), we will attempt defaulting to the foreach implementation when the tensors are all on CUDA. Why not fused? Since the fused implementation is relatively new, we want to give it sufficient bake-in time. To specify fused, pass True for fused. To force running the for-loop implementation, pass False for either foreach or fused.

Note

A prototype implementation of Adam and AdamW for MPS supports torch.float32 and torch.float16.

class Adamax(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.002, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, foreach: bool | None = None, *, maximize: bool = False, differentiable: bool = False, capturable: bool = False)

Implements Adamax algorithm (a variant of Adam based on infinity norm).

\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma \text{ (lr)}, \beta_1, \beta_2 \text{ (betas)},\theta_0 \text{ (params)},f(\theta) \text{ (objective)}, \: \lambda \text{ (weight decay)}, \\ &\hspace{13mm} \epsilon \text{ (epsilon)} \\ &\textbf{initialize} : m_0 \leftarrow 0 \text{ ( first moment)}, u_0 \leftarrow 0 \text{ ( infinity norm)} \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}if \: \lambda \neq 0 \\ &\hspace{10mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \\ &\hspace{5mm}m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ &\hspace{5mm}u_t \leftarrow \mathrm{max}(\beta_2 u_{t-1}, |g_{t}|+\epsilon) \\ &\hspace{5mm}\theta_t \leftarrow \theta_{t-1} - \frac{\gamma m_t}{(1-\beta^t_1) u_t} \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]

For further details regarding the algorithm we refer to Adam: A Method for Stochastic Optimization.

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, Tensor, optional) – learning rate (default: 2e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

  • differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)

  • capturable (bool, optional) – whether this instance is safe to capture in a CUDA graph. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)

step(closure=None)

Performs a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

class NAdam(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.002, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, momentum_decay: float = 0.004, decoupled_weight_decay: bool = False, *, foreach: bool | None = None, maximize: bool = False, capturable: bool = False, differentiable: bool = False)

Implements NAdam algorithm.

\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma_t \text{ (lr)}, \: \beta_1,\beta_2 \text{ (betas)}, \: \theta_0 \text{ (params)}, \: f(\theta) \text{ (objective)} \\ &\hspace{13mm} \: \lambda \text{ (weight decay)}, \:\psi \text{ (momentum decay)} \\ &\hspace{13mm} \: \textit{decoupled\_weight\_decay}, \:\textit{maximize} \\ &\textbf{initialize} : m_0 \leftarrow 0 \text{ ( first moment)}, v_0 \leftarrow 0 \text{ ( second moment)} \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}\textbf{if} \: \textit{maximize}: \\ &\hspace{10mm}g_t \leftarrow -\nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}\textbf{else} \\ &\hspace{10mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm} \theta_t \leftarrow \theta_{t-1} \\ &\hspace{5mm} \textbf{if} \: \lambda \neq 0 \\ &\hspace{10mm}\textbf{if} \: \textit{decoupled\_weight\_decay} \\ &\hspace{15mm} \theta_t \leftarrow \theta_{t-1} - \gamma \lambda \theta_{t-1} \\ &\hspace{10mm}\textbf{else} \\ &\hspace{15mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \\ &\hspace{5mm} \mu_t \leftarrow \beta_1 \big(1 - \frac{1}{2} 0.96^{t \psi} \big) \\ &\hspace{5mm} \mu_{t+1} \leftarrow \beta_1 \big(1 - \frac{1}{2} 0.96^{(t+1)\psi}\big)\\ &\hspace{5mm}m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ &\hspace{5mm}v_t \leftarrow \beta_2 v_{t-1} + (1-\beta_2) g^2_t \\ &\hspace{5mm}\widehat{m_t} \leftarrow \mu_{t+1} m_t/(1-\prod_{i=1}^{t+1}\mu_i)\\[-1.ex] & \hspace{11mm} + (1-\mu_t) g_t /(1-\prod_{i=1}^{t} \mu_{i}) \\ &\hspace{5mm}\widehat{v_t} \leftarrow v_t/\big(1-\beta_2^t \big) \\ &\hspace{5mm}\theta_t \leftarrow \theta_t - \gamma \widehat{m_t}/ \big(\sqrt{\widehat{v_t}} + \epsilon \big) \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]

For further details regarding the algorithm we refer to Incorporating Nesterov Momentum into Adam.

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, Tensor, optional) – learning rate (default: 2e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • momentum_decay (float, optional) – momentum momentum_decay (default: 4e-3)

  • decoupled_weight_decay (bool, optional) – whether to decouple the weight decay as in AdamW to obtain NAdamW. If True, the algorithm does not accumulate weight decay in the momentum nor variance. (default: False)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

  • capturable (bool, optional) – whether this instance is safe to capture in a CUDA graph. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)

  • differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)

step(closure=None)

Perform a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

class RAdam(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, decoupled_weight_decay: bool = False, *, foreach: bool | None = None, maximize: bool = False, capturable: bool = False, differentiable: bool = False)

Implements RAdam algorithm.

\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma \text{ (lr)}, \: \beta_1, \beta_2 \text{ (betas)}, \: \theta_0 \text{ (params)}, \:f(\theta) \text{ (objective)}, \: \lambda \text{ (weightdecay)}, \:\textit{maximize} \\ &\hspace{13mm} \epsilon \text{ (epsilon)}, \textit{decoupled\_weight\_decay} \\ &\textbf{initialize} : m_0 \leftarrow 0 \text{ ( first moment)}, v_0 \leftarrow 0 \text{ ( second moment)}, \\ &\hspace{18mm} \rho_{\infty} \leftarrow 2/(1-\beta_2) -1 \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{6mm}\textbf{if} \: \textit{maximize}: \\ &\hspace{12mm}g_t \leftarrow -\nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{6mm}\textbf{else} \\ &\hspace{12mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{6mm} \theta_t \leftarrow \theta_{t-1} \\ &\hspace{6mm} \textbf{if} \: \lambda \neq 0 \\ &\hspace{12mm}\textbf{if} \: \textit{decoupled\_weight\_decay} \\ &\hspace{18mm} \theta_t \leftarrow \theta_{t} - \gamma \lambda \theta_{t} \\ &\hspace{12mm}\textbf{else} \\ &\hspace{18mm} g_t \leftarrow g_t + \lambda \theta_{t} \\ &\hspace{6mm}m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ &\hspace{6mm}v_t \leftarrow \beta_2 v_{t-1} + (1-\beta_2) g^2_t \\ &\hspace{6mm}\widehat{m_t} \leftarrow m_t/\big(1-\beta_1^t \big) \\ &\hspace{6mm}\rho_t \leftarrow \rho_{\infty} - 2 t \beta^t_2 /\big(1-\beta_2^t \big) \\[0.1.ex] &\hspace{6mm}\textbf{if} \: \rho_t > 5 \\ &\hspace{12mm} l_t \leftarrow \frac{\sqrt{ (1-\beta^t_2) }}{ \sqrt{v_t} +\epsilon } \\ &\hspace{12mm} r_t \leftarrow \sqrt{\frac{(\rho_t-4)(\rho_t-2)\rho_{\infty}}{(\rho_{\infty}-4)(\rho_{\infty}-2) \rho_t}} \\ &\hspace{12mm}\theta_t \leftarrow \theta_t - \gamma \widehat{m_t} r_t l_t \\ &\hspace{6mm}\textbf{else} \\ &\hspace{12mm}\theta_t \leftarrow \theta_t - \gamma \widehat{m_t} \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]

For further details regarding the algorithm we refer to On the variance of the adaptive learning rate and beyond.

This implementation provides an option to use either the original weight_decay implementation as in Adam (where the weight_decay is applied to the gradient) or the one from AdamW (where weight_decay is applied to the weight) through the decoupled_weight_decay option. When decoupled_weight_decay is set to False (default), it uses the original Adam style weight decay, otherwise, it uses the AdamW style which corresponds more closely to the author’s implementation in the RAdam paper. Further information about decoupled weight decay can be found in Decoupled Weight Decay Regularization.

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, Tensor, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • decoupled_weight_decay (bool, optional) – whether to decouple the weight decay as in AdamW to obtain RAdamW. If True, the algorithm does not accumulate weight decay in the momentum nor variance. (default: False)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

  • capturable (bool, optional) – whether this instance is safe to capture in a CUDA graph. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)

  • differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)

step(closure=None)

Perform a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

class RMSprop(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.01, alpha: float = 0.99, eps: float = 1e-08, weight_decay: float = 0, momentum: float = 0, centered: bool = False, capturable: bool = False, foreach: bool | None = None, maximize: bool = False, differentiable: bool = False)

Implements RMSprop algorithm.

\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \alpha \text{ (alpha)}, \: \gamma \text{ (lr)}, \: \theta_0 \text{ (params)}, \: f(\theta) \text{ (objective)} \\ &\hspace{13mm} \lambda \text{ (weight decay)},\: \mu \text{ (momentum)}, \: centered, \: \epsilon \text{ (epsilon)} \\ &\textbf{initialize} : v_0 \leftarrow 0 \text{ (square average)}, \: \textbf{b}_0 \leftarrow 0 \text{ (buffer)}, \: g^{ave}_0 \leftarrow 0 \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}if \: \lambda \neq 0 \\ &\hspace{10mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \\ &\hspace{5mm}v_t \leftarrow \alpha v_{t-1} + (1 - \alpha) g^2_t \hspace{8mm} \\ &\hspace{5mm} \tilde{v_t} \leftarrow v_t \\ &\hspace{5mm}if \: centered \\ &\hspace{10mm} g^{ave}_t \leftarrow g^{ave}_{t-1} \alpha + (1-\alpha) g_t \\ &\hspace{10mm} \tilde{v_t} \leftarrow \tilde{v_t} - \big(g^{ave}_{t} \big)^2 \\ &\hspace{5mm}if \: \mu > 0 \\ &\hspace{10mm} \textbf{b}_t\leftarrow \mu \textbf{b}_{t-1} + g_t/ \big(\sqrt{\tilde{v_t}} + \epsilon \big) \\ &\hspace{10mm} \theta_t \leftarrow \theta_{t-1} - \gamma \textbf{b}_t \\ &\hspace{5mm} else \\ &\hspace{10mm}\theta_t \leftarrow \theta_{t-1} - \gamma g_t/ \big(\sqrt{\tilde{v_t}} + \epsilon \big) \hspace{3mm} \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]

For further details regarding the algorithm we refer to lecture notes by G. Hinton. and centered version Generating Sequences With Recurrent Neural Networks. The implementation here takes the square root of the gradient average before adding epsilon (note that TensorFlow interchanges these two operations). The effective learning rate is thus \(\gamma/(\sqrt{v} + \epsilon)\) where \(\gamma\) is the scheduled learning rate and \(v\) is the weighted moving average of the squared gradient.

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, Tensor, optional) – learning rate (default: 1e-2)

  • alpha (float, optional) – smoothing constant (default: 0.99)

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • momentum (float, optional) – momentum factor (default: 0)

  • centered (bool, optional) – if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance

  • capturable (bool, optional) – whether this instance is safe to capture in a CUDA graph. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

  • differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)

step(closure=None)

Perform a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

class Rprop(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.01, etas: tuple[float, float] = (0.5, 1.2), step_sizes: tuple[float, float] = (1e-06, 50), *, capturable: bool = False, foreach: bool | None = None, maximize: bool = False, differentiable: bool = False)

Implements the resilient backpropagation algorithm.

\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \theta_0 \in \mathbf{R}^d \text{ (params)},f(\theta) \text{ (objective)}, \\ &\hspace{13mm} \eta_{+/-} \text{ (etaplus, etaminus)}, \Gamma_{max/min} \text{ (step sizes)} \\ &\textbf{initialize} : g^0_{prev} \leftarrow 0, \: \eta_0 \leftarrow \text{lr (learning rate)} \\ &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm} \textbf{for} \text{ } i = 0, 1, \ldots, d-1 \: \mathbf{do} \\ &\hspace{10mm} \textbf{if} \: g^i_{prev} g^i_t > 0 \\ &\hspace{15mm} \eta^i_t \leftarrow \mathrm{min}(\eta^i_{t-1} \eta_{+}, \Gamma_{max}) \\ &\hspace{10mm} \textbf{else if} \: g^i_{prev} g^i_t < 0 \\ &\hspace{15mm} \eta^i_t \leftarrow \mathrm{max}(\eta^i_{t-1} \eta_{-}, \Gamma_{min}) \\ &\hspace{15mm} g^i_t \leftarrow 0 \\ &\hspace{10mm} \textbf{else} \: \\ &\hspace{15mm} \eta^i_t \leftarrow \eta^i_{t-1} \\ &\hspace{5mm}\theta_t \leftarrow \theta_{t-1}- \eta_t \mathrm{sign}(g_t) \\ &\hspace{5mm}g_{prev} \leftarrow g_t \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]

For further details regarding the algorithm we refer to the paper A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm.

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, optional) – learning rate (default: 1e-2)

  • etas (Tuple[float, float], optional) – pair of (etaminus, etaplus), that are multiplicative increase and decrease factors (default: (0.5, 1.2))

  • step_sizes (Tuple[float, float], optional) – a pair of minimal and maximal allowed step sizes (default: (1e-6, 50))

  • capturable (bool, optional) – whether this instance is safe to capture in a CUDA graph. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

  • differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)

step(closure=None)

Perform a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

class SGD(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.001, momentum: float = 0, dampening: float = 0, weight_decay: float | Tensor = 0, nesterov: bool = False, *, maximize: bool = False, foreach: bool | None = None, differentiable: bool = False, fused: bool | None = None)

Implements stochastic gradient descent (optionally with momentum).

\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma \text{ (lr)}, \: \theta_0 \text{ (params)}, \: f(\theta) \text{ (objective)}, \: \lambda \text{ (weight decay)}, \\ &\hspace{13mm} \:\mu \text{ (momentum)}, \:\tau \text{ (dampening)}, \:\textit{ nesterov,}\:\textit{ maximize} \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}\textbf{if} \: \lambda \neq 0 \\ &\hspace{10mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \\ &\hspace{5mm}\textbf{if} \: \mu \neq 0 \\ &\hspace{10mm}\textbf{if} \: t > 1 \\ &\hspace{15mm} \textbf{b}_t \leftarrow \mu \textbf{b}_{t-1} + (1-\tau) g_t \\ &\hspace{10mm}\textbf{else} \\ &\hspace{15mm} \textbf{b}_t \leftarrow g_t \\ &\hspace{10mm}\textbf{if} \: \textit{nesterov} \\ &\hspace{15mm} g_t \leftarrow g_{t} + \mu \textbf{b}_t \\ &\hspace{10mm}\textbf{else} \\[-1.ex] &\hspace{15mm} g_t \leftarrow \textbf{b}_t \\ &\hspace{5mm}\textbf{if} \: \textit{maximize} \\ &\hspace{10mm}\theta_t \leftarrow \theta_{t-1} + \gamma g_t \\[-1.ex] &\hspace{5mm}\textbf{else} \\[-1.ex] &\hspace{10mm}\theta_t \leftarrow \theta_{t-1} - \gamma g_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]

Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, Tensor, optional) – learning rate (default: 1e-3)

  • momentum (float, optional) – momentum factor (default: 0)

  • dampening (float, optional) – dampening for momentum (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • nesterov (bool, optional) – enables Nesterov momentum. Only applicable when momentum is non-zero. (default: False)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)

  • differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)

  • fused (bool, optional) – whether the fused implementation is used. Currently, torch.float64, torch.float32, torch.float16, and torch.bfloat16 are supported. (default: None)

Note

The foreach and fused implementations are typically faster than the for-loop, single-tensor implementation, with fused being theoretically fastest with both vertical and horizontal fusion. As such, if the user has not specified either flag (i.e., when foreach = fused = None), we will attempt defaulting to the foreach implementation when the tensors are all on CUDA. Why not fused? Since the fused implementation is relatively new, we want to give it sufficient bake-in time. To specify fused, pass True for fused. To force running the for-loop implementation, pass False for either foreach or fused.

Example

>>> # xdoctest: +SKIP
>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()

Note

The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et al. and implementations in some other frameworks.

Considering the specific case of Momentum, the update can be written as

\[\begin{split}\begin{aligned} v_{t+1} & = \mu * v_{t} + g_{t+1}, \\ p_{t+1} & = p_{t} - \text{lr} * v_{t+1}, \end{aligned}\end{split}\]

where \(p\), \(g\), \(v\) and \(\mu\) denote the parameters, gradient, velocity, and momentum respectively.

This is in contrast to Sutskever et al. and other frameworks which employ an update of the form

\[\begin{split}\begin{aligned} v_{t+1} & = \mu * v_{t} + \text{lr} * g_{t+1}, \\ p_{t+1} & = p_{t} - v_{t+1}. \end{aligned}\end{split}\]

The Nesterov version is analogously modified.

Moreover, the initial value of the momentum buffer is set to the gradient value at the first step. This is in contrast to some other frameworks that initialize it to all zeros.

step(closure=None)

Perform a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

class SparseAdam(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, maximize: bool = False)

SparseAdam implements a masked version of the Adam algorithm suitable for sparse gradients. Currently, due to implementation constraints (explained below), SparseAdam is only intended for a narrow subset of use cases, specifically parameters of a dense layout with gradients of a sparse layout. This occurs in a special case where the module backwards produces grads already in a sparse layout. One example NN module that behaves as such is nn.Embedding(sparse=True).

SparseAdam approximates the Adam algorithm by masking out the parameter and moment updates corresponding to the zero values in the gradients. Whereas the Adam algorithm will update the first moment, the second moment, and the parameters based on all values of the gradients, SparseAdam only updates the moments and parameters corresponding to the non-zero values of the gradients.

A simplified way of thinking about the intended implementation is as such:

  1. Create a mask of the non-zero values in the sparse gradients. For example, if your gradient looks like [0, 5, 0, 0, 9], the mask would be [0, 1, 0, 0, 1].

  2. Apply this mask over the running moments and do computation on only the non-zero values.

  3. Apply this mask over the parameters and only apply an update on non-zero values.

In actuality, we use sparse layout Tensors to optimize this approximation, which means the more gradients that are masked by not being materialized, the more performant the optimization. Since we rely on using sparse layout tensors, we infer that any materialized value in the sparse layout is non-zero and we do NOT actually verify that all values are not zero! It is important to not conflate a semantically sparse tensor (a tensor where many of its values are zeros) with a sparse layout tensor (a tensor where .is_sparse returns True). The SparseAdam approximation is intended for semantically sparse tensors and the sparse layout is only a implementation detail. A clearer implementation would be to use MaskedTensors, but those are experimental.

Note

If you suspect your gradients are semantically sparse (but do not have sparse layout), this variant may not be the best for you. Ideally, you want to avoid materializing anything that is suspected to be sparse in the first place, since needing to convert all your grads from dense layout to sparse layout may outweigh the performance gain. Here, using Adam may be the best alternative, unless you can easily rig up your module to output sparse grads similar to nn.Embedding(sparse=True). If you insist on converting your grads, you can do so by manually overriding your parameters’ .grad fields with their sparse equivalents before calling .step().

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, Tensor, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

step(closure=None)

Perform a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

Schedulers

class accmt.hyperparameters.Scheduler[source]
Constant(last_epoch: int = -1)

Create a schedule with a constant learning rate, using the learning rate set in optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

ConstantWithWarmup(num_warmup_steps: int, last_epoch: int = -1)

Create a schedule with a constant learning rate preceded by a warmup period during which the learning rate increases linearly between 0 and the initial lr set in the optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • num_warmup_steps (int) – The number of steps for the warmup phase.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

class CosineAnnealingLR(optimizer: Optimizer, T_max: int, eta_min: float = 0.0, last_epoch: int = -1)

Set the learning rate of each parameter group using a cosine annealing schedule.

The \(\eta_{max}\) is set to the initial lr and \(T_{cur}\) is the number of epochs since the last restart in SGDR:

\[\begin{split}\begin{aligned} \eta_t & = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right), & T_{cur} \neq (2k+1)T_{max}; \\ \eta_{t+1} & = \eta_{t} + \frac{1}{2}(\eta_{max} - \eta_{min}) \left(1 - \cos\left(\frac{1}{T_{max}}\pi\right)\right), & T_{cur} = (2k+1)T_{max}. \end{aligned}\end{split}\]

When last_epoch=-1, sets initial lr as lr. Notice that because the schedule is defined recursively, the learning rate can be simultaneously modified outside this scheduler by other operators. If the learning rate is set solely by this scheduler, the learning rate at each step becomes:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right)\]

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.

Parameters:
  • optimizer (Optimizer) – Wrapped optimizer.

  • T_max (int) – Maximum number of iterations.

  • eta_min (float) – Minimum learning rate. Default: 0.

  • last_epoch (int) – The index of last epoch. Default: -1.

get_lr()

Retrieve the learning rate of each parameter group.

class CosineAnnealingWarmRestarts(optimizer: Optimizer, T_0: int, T_mult: int = 1, eta_min: float = 0.0, last_epoch: int = -1)

Set the learning rate of each parameter group using a cosine annealing schedule.

The \(\eta_{max}\) is set to the initial lr, \(T_{cur}\) is the number of epochs since the last restart and \(T_{i}\) is the number of epochs between two warm restarts in SGDR:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{i}}\pi\right)\right)\]

When \(T_{cur}=T_{i}\), set \(\eta_t = \eta_{min}\). When \(T_{cur}=0\) after restart, set \(\eta_t=\eta_{max}\).

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts.

Parameters:
  • optimizer (Optimizer) – Wrapped optimizer.

  • T_0 (int) – Number of iterations until the first restart.

  • T_mult (int, optional) – A factor by which \(T_{i}\) increases after a restart. Default: 1.

  • eta_min (float, optional) – Minimum learning rate. Default: 0.

  • last_epoch (int, optional) – The index of the last epoch. Default: -1.

get_lr()

Compute the initial learning rate.

step(epoch=None)

Step could be called after every batch update.

Example

>>> # xdoctest: +SKIP("Undefined vars")
>>> scheduler = CosineAnnealingWarmRestarts(optimizer, T_0, T_mult)
>>> iters = len(dataloader)
>>> for epoch in range(20):
>>>     for i, sample in enumerate(dataloader):
>>>         inputs, labels = sample['inputs'], sample['labels']
>>>         optimizer.zero_grad()
>>>         outputs = net(inputs)
>>>         loss = criterion(outputs, labels)
>>>         loss.backward()
>>>         optimizer.step()
>>>         scheduler.step(epoch + i / iters)

This function can be called in an interleaved way.

Example

>>> # xdoctest: +SKIP("Undefined vars")
>>> scheduler = CosineAnnealingWarmRestarts(optimizer, T_0, T_mult)
>>> for epoch in range(20):
>>>     scheduler.step()
>>> scheduler.step(26)
>>> scheduler.step() # scheduler.step(27), instead of scheduler(20)
CosineWithHardRestartsWithWarmup(num_warmup_steps: int, num_training_steps: int, num_cycles: int = 1, last_epoch: int = -1)

Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, with several hard restarts, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • num_warmup_steps (int) – The number of steps for the warmup phase.

  • num_training_steps (int) – The total number of training steps.

  • num_cycles (int, optional, defaults to 1) – The number of hard restarts to use.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

CosineWithWarmup(num_warmup_steps: int, num_training_steps: int, num_cycles: float = 0.5, last_epoch: int = -1)

Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • num_warmup_steps (int) – The number of steps for the warmup phase.

  • num_training_steps (int) – The total number of training steps.

  • num_cycles (float, optional, defaults to 0.5) – The number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0 following a half-cosine).

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

class CyclicLR(optimizer: Optimizer, base_lr: float | list[float], max_lr: float | list[float], step_size_up: int = 2000, step_size_down: int | None = None, mode: Literal['triangular', 'triangular2', 'exp_range'] = 'triangular', gamma: float = 1.0, scale_fn: Callable[[float], float] | None = None, scale_mode: Literal['cycle', 'iterations'] = 'cycle', cycle_momentum: bool = True, base_momentum: float = 0.8, max_momentum: float = 0.9, last_epoch: int = -1)

Sets the learning rate of each parameter group according to cyclical learning rate policy (CLR).

The policy cycles the learning rate between two boundaries with a constant frequency, as detailed in the paper Cyclical Learning Rates for Training Neural Networks. The distance between the two boundaries can be scaled on a per-iteration or per-cycle basis.

Cyclical learning rate policy changes the learning rate after every batch. step should be called after a batch has been used for training.

This class has three built-in policies, as put forth in the paper:

  • “triangular”: A basic triangular cycle without amplitude scaling.

  • “triangular2”: A basic triangular cycle that scales initial amplitude by half each cycle.

  • “exp_range”: A cycle that scales initial amplitude by \(\text{gamma}^{\text{cycle iterations}}\) at each cycle iteration.

This implementation was adapted from the github repo: bckenstler/CLR

Parameters:
  • optimizer (Optimizer) – Wrapped optimizer.

  • base_lr (float or list) – Initial learning rate which is the lower boundary in the cycle for each parameter group.

  • max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_lr - base_lr). The lr at any cycle is the sum of base_lr and some scaling of the amplitude; therefore max_lr may not actually be reached depending on scaling function.

  • step_size_up (int) – Number of training iterations in the increasing half of a cycle. Default: 2000

  • step_size_down (int) – Number of training iterations in the decreasing half of a cycle. If step_size_down is None, it is set to step_size_up. Default: None

  • mode (str) – One of {triangular, triangular2, exp_range}. Values correspond to policies detailed above. If scale_fn is not None, this argument is ignored. Default: ‘triangular’

  • gamma (float) – Constant in ‘exp_range’ scaling function: gamma**(cycle iterations) Default: 1.0

  • scale_fn (function) – Custom scaling policy defined by a single argument lambda function, where 0 <= scale_fn(x) <= 1 for all x >= 0. If specified, then ‘mode’ is ignored. Default: None

  • scale_mode (str) – {‘cycle’, ‘iterations’}. Defines whether scale_fn is evaluated on cycle number or cycle iterations (training iterations since start of cycle). Default: ‘cycle’

  • cycle_momentum (bool) – If True, momentum is cycled inversely to learning rate between ‘base_momentum’ and ‘max_momentum’. Default: True

  • base_momentum (float or list) – Lower momentum boundaries in the cycle for each parameter group. Note that momentum is cycled inversely to learning rate; at the peak of a cycle, momentum is ‘base_momentum’ and learning rate is ‘max_lr’. Default: 0.8

  • max_momentum (float or list) – Upper momentum boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_momentum - base_momentum). The momentum at any cycle is the difference of max_momentum and some scaling of the amplitude; therefore base_momentum may not actually be reached depending on scaling function. Note that momentum is cycled inversely to learning rate; at the start of a cycle, momentum is ‘max_momentum’ and learning rate is ‘base_lr’ Default: 0.9

  • last_epoch (int) – The index of the last batch. This parameter is used when resuming a training job. Since step() should be invoked after each batch instead of after each epoch, this number represents the total number of batches computed, not the total number of epochs computed. When last_epoch=-1, the schedule is started from the beginning. Default: -1

Example

>>> # xdoctest: +SKIP
>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=0.01, max_lr=0.1)
>>> data_loader = torch.utils.data.DataLoader(...)
>>> for epoch in range(10):
>>>     for batch in data_loader:
>>>         train_batch(...)
>>>         scheduler.step()
get_lr()

Calculate the learning rate at batch index.

This function treats self.last_epoch as the last batch index.

If self.cycle_momentum is True, this function has a side effect of updating the optimizer’s momentum.

load_state_dict(state_dict)

Load the scheduler’s state.

scale_fn(x) float

Get the scaling policy.

state_dict()

Return the state of the scheduler as a dict.

It contains an entry for every variable in self.__dict__ which is not the optimizer.

class ExponentialLR(optimizer: Optimizer, gamma: float, last_epoch: int = -1)

Decays the learning rate of each parameter group by gamma every epoch.

When last_epoch=-1, sets initial lr as lr.

Parameters:
  • optimizer (Optimizer) – Wrapped optimizer.

  • gamma (float) – Multiplicative factor of learning rate decay.

  • last_epoch (int) – The index of last epoch. Default: -1.

get_lr()

Compute the learning rate of each parameter group.

InverseSQRT(num_warmup_steps: int, timescale: int | None = None, last_epoch: int = -1)

Create a schedule with an inverse square-root learning rate, from the initial lr set in the optimizer, after a warmup period which increases lr linearly from 0 to the initial lr set in the optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • num_warmup_steps (int) – The number of steps for the warmup phase.

  • timescale (int, optional, defaults to num_warmup_steps) – Time scale.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

class LinearLR(optimizer: Optimizer, start_factor: float = 0.3333333333333333, end_factor: float = 1.0, total_iters: int = 5, last_epoch: int = -1)

Decays the learning rate of each parameter group by linearly changing small multiplicative factor.

The multiplication is done until the number of epoch reaches a pre-defined milestone: total_iters. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.

Parameters:
  • optimizer (Optimizer) – Wrapped optimizer.

  • start_factor (float) – The number we multiply learning rate in the first epoch. The multiplication factor changes towards end_factor in the following epochs. Default: 1./3.

  • end_factor (float) – The number we multiply learning rate at the end of linear changing process. Default: 1.0.

  • total_iters (int) – The number of iterations that multiplicative factor reaches to 1. Default: 5.

  • last_epoch (int) – The index of the last epoch. Default: -1.

Example

>>> # xdoctest: +SKIP
>>> # Assuming optimizer uses lr = 0.05 for all groups
>>> # lr = 0.025    if epoch == 0
>>> # lr = 0.03125  if epoch == 1
>>> # lr = 0.0375   if epoch == 2
>>> # lr = 0.04375  if epoch == 3
>>> # lr = 0.05    if epoch >= 4
>>> scheduler = LinearLR(optimizer, start_factor=0.5, total_iters=4)
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()
get_lr()

Compute the learning rate.

LinearWithWarmup(num_warmup_steps, num_training_steps, last_epoch=-1)

Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • num_warmup_steps (int) – The number of steps for the warmup phase.

  • num_training_steps (int) – The total number of training steps.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

class OneCycleLR(optimizer: Optimizer, max_lr: float | list[float], total_steps: int | None = None, epochs: int | None = None, steps_per_epoch: int | None = None, pct_start: float = 0.3, anneal_strategy: Literal['cos', 'linear'] = 'cos', cycle_momentum: bool = True, base_momentum: float | list[float] = 0.85, max_momentum: float | list[float] = 0.95, div_factor: float = 25.0, final_div_factor: float = 10000.0, three_phase: bool = False, last_epoch: int = -1)

Sets the learning rate of each parameter group according to the 1cycle learning rate policy.

The 1cycle policy anneals the learning rate from an initial learning rate to some maximum learning rate and then from that maximum learning rate to some minimum learning rate much lower than the initial learning rate. This policy was initially described in the paper Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates.

The 1cycle learning rate policy changes the learning rate after every batch. step should be called after a batch has been used for training.

This scheduler is not chainable.

Note also that the total number of steps in the cycle can be determined in one of two ways (listed in order of precedence):

  1. A value for total_steps is explicitly provided.

  2. A number of epochs (epochs) and a number of steps per epoch (steps_per_epoch) are provided. In this case, the number of total steps is inferred by total_steps = epochs * steps_per_epoch

You must either provide a value for total_steps or provide a value for both epochs and steps_per_epoch.

The default behaviour of this scheduler follows the fastai implementation of 1cycle, which claims that “unpublished work has shown even better results by using only two phases”. To mimic the behaviour of the original paper instead, set three_phase=True.

Parameters:
  • optimizer (Optimizer) – Wrapped optimizer.

  • max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group.

  • total_steps (int) – The total number of steps in the cycle. Note that if a value is not provided here, then it must be inferred by providing a value for epochs and steps_per_epoch. Default: None

  • epochs (int) – The number of epochs to train for. This is used along with steps_per_epoch in order to infer the total number of steps in the cycle if a value for total_steps is not provided. Default: None

  • steps_per_epoch (int) – The number of steps per epoch to train for. This is used along with epochs in order to infer the total number of steps in the cycle if a value for total_steps is not provided. Default: None

  • pct_start (float) – The percentage of the cycle (in number of steps) spent increasing the learning rate. Default: 0.3

  • anneal_strategy (str) – {‘cos’, ‘linear’} Specifies the annealing strategy: “cos” for cosine annealing, “linear” for linear annealing. Default: ‘cos’

  • cycle_momentum (bool) – If True, momentum is cycled inversely to learning rate between ‘base_momentum’ and ‘max_momentum’. Default: True

  • base_momentum (float or list) – Lower momentum boundaries in the cycle for each parameter group. Note that momentum is cycled inversely to learning rate; at the peak of a cycle, momentum is ‘base_momentum’ and learning rate is ‘max_lr’. Default: 0.85

  • max_momentum (float or list) – Upper momentum boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_momentum - base_momentum). Note that momentum is cycled inversely to learning rate; at the start of a cycle, momentum is ‘max_momentum’ and learning rate is ‘base_lr’ Default: 0.95

  • div_factor (float) – Determines the initial learning rate via initial_lr = max_lr/div_factor Default: 25

  • final_div_factor (float) – Determines the minimum learning rate via min_lr = initial_lr/final_div_factor Default: 1e4

  • three_phase (bool) – If True, use a third phase of the schedule to annihilate the learning rate according to ‘final_div_factor’ instead of modifying the second phase (the first two phases will be symmetrical about the step indicated by ‘pct_start’).

  • last_epoch (int) – The index of the last batch. This parameter is used when resuming a training job. Since step() should be invoked after each batch instead of after each epoch, this number represents the total number of batches computed, not the total number of epochs computed. When last_epoch=-1, the schedule is started from the beginning. Default: -1

Example

>>> # xdoctest: +SKIP
>>> data_loader = torch.utils.data.DataLoader(...)
>>> optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
>>> scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.01, steps_per_epoch=len(data_loader), epochs=10)
>>> for epoch in range(10):
>>>     for batch in data_loader:
>>>         train_batch(...)
>>>         optimizer.step()
>>>         scheduler.step()
static _annealing_cos(start, end, pct)

Cosine anneal from start to end as pct goes from 0.0 to 1.0.

static _annealing_linear(start, end, pct)

Linearly anneal from start to end as pct goes from 0.0 to 1.0.

get_lr()

Compute the learning rate of each parameter group.

PolynomialDecayWithWarmup(num_warmup_steps, num_training_steps, lr_end=1e-07, power=1.0, last_epoch=-1)

Create a schedule with a learning rate that decreases as a polynomial decay from the initial lr set in the optimizer to end lr defined by lr_end, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.

Parameters:
  • optimizer ([~torch.optim.Optimizer]) – The optimizer for which to schedule the learning rate.

  • num_warmup_steps (int) – The number of steps for the warmup phase.

  • num_training_steps (int) – The total number of training steps.

  • lr_end (float, optional, defaults to 1e-7) – The end LR.

  • power (float, optional, defaults to 1.0) – Power factor.

  • last_epoch (int, optional, defaults to -1) – The index of the last epoch when resuming training.

Note: power defaults to 1.0 as in the fairseq implementation, which in turn is based on the original BERT implementation at https://github.com/google-research/bert/blob/f39e881b169b9d53bea03d2d341b31707a6c052b/optimization.py#L37

Returns:

torch.optim.lr_scheduler.LambdaLR with the appropriate schedule.

class StepLR(optimizer: Optimizer, step_size: int, gamma: float = 0.1, last_epoch: int = -1)

Decays the learning rate of each parameter group by gamma every step_size epochs.

Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.

Parameters:
  • optimizer (Optimizer) – Wrapped optimizer.

  • step_size (int) – Period of learning rate decay.

  • gamma (float) – Multiplicative factor of learning rate decay. Default: 0.1.

  • last_epoch (int) – The index of last epoch. Default: -1.

Example

>>> # xdoctest: +SKIP
>>> # Assuming optimizer uses lr = 0.05 for all groups
>>> # lr = 0.05     if epoch < 30
>>> # lr = 0.005    if 30 <= epoch < 60
>>> # lr = 0.0005   if 60 <= epoch < 90
>>> # ...
>>> scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()
get_lr()

Compute the learning rate of each parameter group.

HyperParameterSearch

class accmt.hp_search.HyperParameterSearch(get_module_fn: Callable, train_dataset: Dataset, val_dataset: Dataset | list[Dataset] | dict[str, Dataset], metrics: list[Metric] | None = None, **trainer_kwargs)[source]
__init__(get_module_fn: Callable, train_dataset: Dataset, val_dataset: Dataset | list[Dataset] | dict[str, Dataset], metrics: list[Metric] | None = None, **trainer_kwargs)[source]

Initialize the hyperparameter search.

Parameters:
  • get_module_fn (Callable) – A function that returns an AcceleratorModule (basically, the model initialization). It does not take any arguments.

  • train_dataset (Dataset) – The training dataset.

  • val_dataset (Dataset or list[Dataset] or dict[str, Dataset]) – The validation dataset(s).

  • metrics (list, optional, defaults to None) – The metrics modules to evaluate. If not provided, the default metric will be used (valid_loss).

  • **trainer_kwargs – Additional keyword arguments to pass to the Trainer constructor.

optimize(best_metric_fn: Callable | None = None, direction: Literal['maximize', 'minimize'] = 'minimize', n_trials: int = 10)[source]

Optimize an objective function.

Parameters:
  • best_metric_fn (Callable, optional, defaults to None) – A function that takes a dictionary of additional metrics and returns the best metric. This function receives a single argument, which is a dictionary of metrics. Must return a float.

  • n_trials (int, optional, defaults to 10) – The number of trials to run.

set_parameters(*, train_batch_size: int | list[int] = 1, epochs: int | list[int] = 1, max_steps: int | list[int] = None, learning_rate: float | list[float] = [0.001, 1e-07], beta1: float | list[float] = 0.9, beta2: float | list[float] = 0.999, eps: float | list[float] = 1e-08, weight_decay: float | list[float] = 0.01, scheduler: Scheduler | dict | None = None, warmup_ratio: float | list[float] = 0.1, step_scheduler_per_epoch: bool = False, optimizer: Literal['AdamW', 'Adam'] = 'AdamW', eval_batch_size: int | None = None)[source]

Set the parameters for the hyperparameter search. Fixed values are represented as a single value. A range of parameters is represented as a list of values. A tuple of values means discrete values.

Parameters:
  • train_batch_size (int or list, optional, defaults to 1) – The batch size for the training set. This is a categorical argument, so Optuna will sample from the list of options.

  • epochs (int or list, optional, defaults to 1) – The number of epochs to train the model.

  • max_steps (int or list, optional, defaults to None) – The maximum number of steps to train the model. Drop-in replacement for epochs.

  • learning_rate (float or list, optional, defaults to [1e-3, 1e-7]) – The learning rate to use for the optimizer.

  • beta1 (float or list, optional, defaults to 0.9) – The beta1 parameter for the Adam optimizer.

  • beta2 (float or list, optional, defaults to 0.999) – The beta2 parameter for the Adam optimizer.

  • eps (float or list, optional, defaults to 1e-08) – The epsilon parameter for the Adam optimizer.

  • weight_decay (float or list, optional, defaults to 0.01) – The weight decay to use for the optimizer.

  • scheduler (dict or Scheduler, optional, defaults to None) – The learning rate scheduler to use for the optimizer. This is a dictionary containing the scheduler name as a key and the scheduler object as a value. Do not consider this argument as a range, but a list of options to choose from.

  • warmup_ratio (float or list, optional, defaults to 0.1) – The warmup ratio to use for the learning rate scheduler. Only used if scheduler is not None and is a warmup scheduler.

  • step_scheduler_per_epoch (bool, optional, defaults to False) – Whether to step the scheduler per epoch or per step. This is a fixed value.

  • optimizer (Literal[“AdamW”, “Adam”], optional, defaults to “AdamW”) – The optimizer to use for the training (not a range, just a fixed value).

  • eval_batch_size (int, optional, defaults to None) – The batch size for the evaluation set. If not provided, the training batch size will be used. NOTE: This is not a hyperparameter, so it should be a fixed value.