Model Checkpointing¶

DeepSpeed provides routines for checkpointing model state during training.

Loading Training Checkpoints¶

deepspeed.DeepSpeedEngine.load_checkpoint(self, load_dir, tag=None, load_module_strict=True, load_optimizer_states=True, load_lr_scheduler_states=True)¶

Load training checkpoint

Parameters:

load_dir – Required. Directory to load the checkpoint from
tag – Checkpoint tag used as a unique identifier for checkpoint, if not provided will attempt to load tag in ‘latest’ file
load_module_strict – Optional. Boolean to strictly enforce that the keys in state_dict of module and checkpoint match.
load_optimizer_states – Optional. Boolean to load the training optimizer states from Checkpoint. Ex. ADAM’s momentum and variance
load_lr_scheduler_states – Optional. Boolean to add the learning rate scheduler states from Checkpoint.

Returns:

A tuple of load_path and client_state.

*load_path: Path of the loaded checkpoint. None if loading the checkpoint failed.

*client_state: State dictionary used for loading required training states in the client code.

Saving Training Checkpoints¶

deepspeed.DeepSpeedEngine.save_checkpoint(self, save_dir, tag=None, client_state={}, save_latest=True)¶

Save training checkpoint

Parameters:

save_dir – Required. Directory for saving the checkpoint
tag – Optional. Checkpoint tag used as a unique identifier for the checkpoint, global step is used if not provided. Tag name must be the same across all ranks.
client_state – Optional. State dictionary used for saving required training states in the client code.
save_latest – Optional. Save a file ‘latest’ pointing to the latest saved checkpoint.