Model Checkpointing¶
DeepSpeed provides routines for checkpointing model state during training.
Loading Training Checkpoints¶
-
deepspeed.DeepSpeedEngine.
load_checkpoint
(self, load_dir, tag=None, load_module_strict=True, load_optimizer_states=True, load_lr_scheduler_states=True)¶ Load training checkpoint
Parameters: - load_dir – Required. Directory to load the checkpoint from
- tag – Checkpoint tag used as a unique identifier for checkpoint, if not provided will attempt to load tag in ‘latest’ file
- load_module_strict – Optional. Boolean to strictly enforce that the keys in state_dict of module and checkpoint match.
- load_optimizer_states – Optional. Boolean to load the training optimizer states from Checkpoint. Ex. ADAM’s momentum and variance
- load_lr_scheduler_states – Optional. Boolean to add the learning rate scheduler states from Checkpoint.
Returns: A tuple of
load_path
andclient_state
.*
load_path
: Path of the loaded checkpoint.None
if loading the checkpoint failed.*
client_state
: State dictionary used for loading required training states in the client code.
Saving Training Checkpoints¶
-
deepspeed.DeepSpeedEngine.
save_checkpoint
(self, save_dir, tag=None, client_state={}, save_latest=True)¶ Save training checkpoint
Parameters: - save_dir – Required. Directory for saving the checkpoint
- tag – Optional. Checkpoint tag used as a unique identifier for the checkpoint, global step is used if not provided. Tag name must be the same across all ranks.
- client_state – Optional. State dictionary used for saving required training states in the client code.
- save_latest – Optional. Save a file ‘latest’ pointing to the latest saved checkpoint.