Public Member Functions | |
| def | __init__ (self, db_prefix, db_type, metadata_handler=None) |
| def | init (self, nodes, retrieve_from_epoch=None, path_prefix=None, path_type=None) |
| def | load (self, epoch, path_prefix=None, path_type=None) |
| def | load_blobs_locally (self, nodes, blob_names, epoch, session) |
| def | get_ckpt_db_name (self, node_name, epoch) |
| def | report_checkpoint_stats (self, action_name) |
| def | save (self, epoch) |
| def | write_checkpoint_metadata (self, epoch) |
| def | get_resume_from_epoch_id (self, user_epoch=None) |
| def | set_params (self, nodes, path_prefix=None, path_type=None) |
| def | cp_accessible (self, epoch=None) |
Coordinates checkpointing and checkpointing across multiple nodes.
Each of `init`, `load` and `save` will build TaskGroups which will
trigger checkpointing on each of the nodes involved in a distributed job.
Args:
db_prefix: The prefix used to construct full db name. Since `absolute_path`
is set to True, this will be used as db_name in SaveOp.
db_type: Type of database to use for storing checkpoint.
metadata_handler: An optional object capable of reading/writing
checkpoint info in storage of choice.
Definition at line 432 of file checkpoint.py.
| def caffe2.python.checkpoint.MultiNodeCheckpointManager.cp_accessible | ( | self, | |
epoch = None |
|||
| ) |
Returns True if Checkpoint data is accessible
Args:
epoch: An integer. The epoch of the checkpoint. If None,
it implies we need to check if checkpoint directory is accessible
Returns:
is_cp_accessible: A boolean. Returns True if Checkpoint data is accessible
Definition at line 621 of file checkpoint.py.
| def caffe2.python.checkpoint.MultiNodeCheckpointManager.get_ckpt_db_name | ( | self, | |
| node_name, | |||
| epoch | |||
| ) |
Returns the DB name of the given node and the given epoch.
The DB name is effectively the checkpoint path of the given node and
the given epoch.
Args:
node_name: A string. The node name of interest.
epoch: An integer. The epoch of the checkpoint.
Returns:
checkpoint_db_name: A string. The checkpoint path of the given
node and the given epoch.
Definition at line 531 of file checkpoint.py.
| def caffe2.python.checkpoint.MultiNodeCheckpointManager.get_resume_from_epoch_id | ( | self, | |
user_epoch = None |
|||
| ) |
Identify the epoch-id from which Job must resume
Args:
user_epoch: An integer. Optional parameter for user to explicitly
identify the epoch-id to load checkpoint from
Retruns:
epoch: the epoch-id to load checkpoints from
or None if no checkpoints were written
Definition at line 583 of file checkpoint.py.
| def caffe2.python.checkpoint.MultiNodeCheckpointManager.load_blobs_locally | ( | self, | |
| nodes, | |||
| blob_names, | |||
| epoch, | |||
| session | |||
| ) |
Loads the necessary blobs from the checkpoints to the current node.
Args:
blob_names: A list of strings. Each string is the name of a
blob.
epoch: An integer. The checkpoint epoch to load from.
session: A Session object to execute the Load ops.
Definition at line 497 of file checkpoint.py.
| def caffe2.python.checkpoint.MultiNodeCheckpointManager.report_checkpoint_stats | ( | self, | |
| action_name | |||
| ) |
Report the checkpoint stats for all the nodes, we need to aggregate all
the node's stats together so that we know which node's checkpoint
operation dominates.
Args:
action_name: A string of the name of checkpoint operation.
Definition at line 549 of file checkpoint.py.
| def caffe2.python.checkpoint.MultiNodeCheckpointManager.save | ( | self, | |
| epoch | |||
| ) |
Build a Task that will execute a Save ops to serialize and persist blobs present in the global workspace.
Definition at line 565 of file checkpoint.py.
| def caffe2.python.checkpoint.MultiNodeCheckpointManager.set_params | ( | self, | |
| nodes, | |||
path_prefix = None, |
|||
path_type = None |
|||
| ) |
Set parameters associated with CP manager
Args:
nodes: An array of nodes where this checkpoint manager is running.
path_prefix: Used to construct db name or path where checkpoint files are
stored.
path_type: Indicate the type of path where checkpoint files are stored.
Definition at line 599 of file checkpoint.py.
| def caffe2.python.checkpoint.MultiNodeCheckpointManager.write_checkpoint_metadata | ( | self, | |
| epoch | |||
| ) |
Write metadata for checkpoint
Args:
epoch: An integer. The epoch-id for which checkpoint metadata is
written
Definition at line 572 of file checkpoint.py.
1.8.11