Optimizers¶

DeepSpeed offers high-performance implementations of Adam and Lamb optimizers on CPU and GPU, respectively.

DeepSpeed CPU Adam¶

class deepspeed.ops.adam.DeepSpeedCPUAdam(model_params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, adamw_mode=True)[source]¶

Fast vectorized implementation of two variations of Adam optimizer on CPU:

Adam: A Method for Stochastic Optimization: (https://arxiv.org/abs/1412.6980);

AdamW: FIXING WEIGHT DECAY REGULARIZATION IN ADAM (https://arxiv.org/abs/1711.05101v1)

DeepSpeed CPU Adam(W) provides between 5x to 7x speedu over torch.optim.adam(W). In order to apply this optimizer, the model requires to have its master parameter (in FP32) reside on the CPU memory.

To train on a hetrogeneous system, such as coordinating CPU and GPU, DeepSpeed offers the ZeRO-Offload technology which efficiently offloads the optimizer states into CPU memory, with minimal impact on training througput. DeepSpeedCPUAdam plays an important role to minimize the overhead of the optimizer’s latency on CPU. Please refer to ZeRO-Offload tutorial (https://www.deepspeed.ai/tutorials/zero-offload/) for more information on how to enable this technology.

For calling step function, there are two options available: (1) update optimizer’s states and (2) update optimizer’s states and copy the parameters back to GPU at the same time. We have seen that the second option can bring 30% higher throughput than the doing the copy separately using option one.

Parameters:

model_params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional) – learning rate. (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper `On the Convergence of Adam and Beyond`_ (default: False) NOT SUPPORTED in DeepSpeed CPUAdam!
adamw_mode – select between Adam and AdamW implementations (default: AdamW)

DeepSpeed Fused Lamb¶

class deepspeed.ops.adam.DeepSpeedCPUAdam(model_params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, adamw_mode=True)[source]

Fast vectorized implementation of two variations of Adam optimizer on CPU:

Adam: A Method for Stochastic Optimization: (https://arxiv.org/abs/1412.6980);

AdamW: FIXING WEIGHT DECAY REGULARIZATION IN ADAM (https://arxiv.org/abs/1711.05101v1)

DeepSpeed CPU Adam(W) provides between 5x to 7x speedu over torch.optim.adam(W). In order to apply this optimizer, the model requires to have its master parameter (in FP32) reside on the CPU memory.

To train on a hetrogeneous system, such as coordinating CPU and GPU, DeepSpeed offers the ZeRO-Offload technology which efficiently offloads the optimizer states into CPU memory, with minimal impact on training througput. DeepSpeedCPUAdam plays an important role to minimize the overhead of the optimizer’s latency on CPU. Please refer to ZeRO-Offload tutorial (https://www.deepspeed.ai/tutorials/zero-offload/) for more information on how to enable this technology.

For calling step function, there are two options available: (1) update optimizer’s states and (2) update optimizer’s states and copy the parameters back to GPU at the same time. We have seen that the second option can bring 30% higher throughput than the doing the copy separately using option one.

Parameters:

model_params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional) – learning rate. (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper `On the Convergence of Adam and Beyond`_ (default: False) NOT SUPPORTED in DeepSpeed CPUAdam!
adamw_mode – select between Adam and AdamW implementations (default: AdamW)