Public Member Functions | |
def | __init__ (self, db_prefix, db_type, metadata_handler=None) |
def | init (self, nodes, retrieve_from_epoch=None, path_prefix=None, path_type=None) |
def | load (self, epoch, path_prefix=None, path_type=None) |
def | load_blobs_locally (self, nodes, blob_names, epoch, session) |
def | get_ckpt_db_name (self, node_name, epoch) |
def | report_checkpoint_stats (self, action_name) |
def | save (self, epoch) |
def | write_checkpoint_metadata (self, epoch) |
def | get_resume_from_epoch_id (self, user_epoch=None) |
def | set_params (self, nodes, path_prefix=None, path_type=None) |
def | cp_accessible (self, epoch=None) |
Coordinates checkpointing and checkpointing across multiple nodes. Each of `init`, `load` and `save` will build TaskGroups which will trigger checkpointing on each of the nodes involved in a distributed job. Args: db_prefix: The prefix used to construct full db name. Since `absolute_path` is set to True, this will be used as db_name in SaveOp. db_type: Type of database to use for storing checkpoint. metadata_handler: An optional object capable of reading/writing checkpoint info in storage of choice.
Definition at line 432 of file checkpoint.py.
def caffe2.python.checkpoint.MultiNodeCheckpointManager.cp_accessible | ( | self, | |
epoch = None |
|||
) |
Returns True if Checkpoint data is accessible Args: epoch: An integer. The epoch of the checkpoint. If None, it implies we need to check if checkpoint directory is accessible Returns: is_cp_accessible: A boolean. Returns True if Checkpoint data is accessible
Definition at line 621 of file checkpoint.py.
def caffe2.python.checkpoint.MultiNodeCheckpointManager.get_ckpt_db_name | ( | self, | |
node_name, | |||
epoch | |||
) |
Returns the DB name of the given node and the given epoch. The DB name is effectively the checkpoint path of the given node and the given epoch. Args: node_name: A string. The node name of interest. epoch: An integer. The epoch of the checkpoint. Returns: checkpoint_db_name: A string. The checkpoint path of the given node and the given epoch.
Definition at line 531 of file checkpoint.py.
def caffe2.python.checkpoint.MultiNodeCheckpointManager.get_resume_from_epoch_id | ( | self, | |
user_epoch = None |
|||
) |
Identify the epoch-id from which Job must resume Args: user_epoch: An integer. Optional parameter for user to explicitly identify the epoch-id to load checkpoint from Retruns: epoch: the epoch-id to load checkpoints from or None if no checkpoints were written
Definition at line 583 of file checkpoint.py.
def caffe2.python.checkpoint.MultiNodeCheckpointManager.load_blobs_locally | ( | self, | |
nodes, | |||
blob_names, | |||
epoch, | |||
session | |||
) |
Loads the necessary blobs from the checkpoints to the current node. Args: blob_names: A list of strings. Each string is the name of a blob. epoch: An integer. The checkpoint epoch to load from. session: A Session object to execute the Load ops.
Definition at line 497 of file checkpoint.py.
def caffe2.python.checkpoint.MultiNodeCheckpointManager.report_checkpoint_stats | ( | self, | |
action_name | |||
) |
Report the checkpoint stats for all the nodes, we need to aggregate all the node's stats together so that we know which node's checkpoint operation dominates. Args: action_name: A string of the name of checkpoint operation.
Definition at line 549 of file checkpoint.py.
def caffe2.python.checkpoint.MultiNodeCheckpointManager.save | ( | self, | |
epoch | |||
) |
Build a Task that will execute a Save ops to serialize and persist blobs present in the global workspace.
Definition at line 565 of file checkpoint.py.
def caffe2.python.checkpoint.MultiNodeCheckpointManager.set_params | ( | self, | |
nodes, | |||
path_prefix = None , |
|||
path_type = None |
|||
) |
Set parameters associated with CP manager Args: nodes: An array of nodes where this checkpoint manager is running. path_prefix: Used to construct db name or path where checkpoint files are stored. path_type: Indicate the type of path where checkpoint files are stored.
Definition at line 599 of file checkpoint.py.
def caffe2.python.checkpoint.MultiNodeCheckpointManager.write_checkpoint_metadata | ( | self, | |
epoch | |||
) |
Write metadata for checkpoint Args: epoch: An integer. The epoch-id for which checkpoint metadata is written
Definition at line 572 of file checkpoint.py.