Public Member Functions | |
def | __init__ (self, job, checkpoint_manager=None, resume_from_epoch=None, upload_task_group_builder=None) |
def | train (self, session) |
def | load_blobs_from_checkpoints (self, blob_names, epoch, session) |
def | save_checkpoints (self, epoch, session) |
Public Attributes | |
resume_from_epoch | |
checkpoint_manager | |
job | |
upload_task_group_builder | |
Implement the runtime logic for jobs with checkpointing at the level of epoch. Can be used to run either single-host or distributed jobs. Job runner is a callable to be called once from the master, passing a session as an argument. This call will block until the Job execution is complete. If a checkpoint_manager is passed, checkpoints will be taken after initialization and after each epoch execution. If, in addition, `resume_from_epoch` is an epoch number, the corresponding checkpoint will be loaded and job execution will continue from the given epoch. In this case, the job's init_group will not be run. Refer to checkpoint_test.py for an example.
Definition at line 655 of file checkpoint.py.
def caffe2.python.checkpoint.JobRunner.__init__ | ( | self, | |
job, | |||
checkpoint_manager = None , |
|||
resume_from_epoch = None , |
|||
upload_task_group_builder = None |
|||
) |
Initializes the JobRunner. Args: job: A Job object. The job to be executed. checkpoint_manager: Can be a CheckpointManager for single machine or a MultiNodeCheckpointManager for multi-machine. The manager that initializes/saves/loads checkpoints. resume_from_epoch: An integer. The epoch to resume from. upload_task_group_builder: A subclass of the UploadTaskGroupBuilder. Creates a task group to upload checkpoints.
Definition at line 671 of file checkpoint.py.
def caffe2.python.checkpoint.JobRunner.load_blobs_from_checkpoints | ( | self, | |
blob_names, | |||
epoch, | |||
session | |||
) |
Loads the necessary blobs from the checkpoints. Checkpoints store the snapshots of the workspace in each node. Sometimes we only need to load a subset of the blobs from the checkpoints. One common scenario is to load only the model blobs from the checkpoints for evaluation purpose. Given the names of the necessary blobs, this function goes over all the checkpoints of all the nodes, but only loads the blobs specified in the blob_names to the current workspace. Args: blob_names: A list of strings. Each string is the name of a blob. epoch: An integer. The checkpoint epoch to load from. session: A Session object to execute the load ops. Raises: ValueError: When the checkpoint manager is invalid.
Definition at line 761 of file checkpoint.py.
def caffe2.python.checkpoint.JobRunner.save_checkpoints | ( | self, | |
epoch, | |||
session | |||
) |
Triggers operation to save checkpoints This method will trigger the Save ops to serialize and persist the blobs present in the global workspaace. Args: epoch: An integer. The checkpoint epoch-id that we are saving. session: A Session object to execute the save ops. Raises: ValueError: When the checkpoint manager is invalid.
Definition at line 789 of file checkpoint.py.
def caffe2.python.checkpoint.JobRunner.train | ( | self, | |
session | |||
) |
Runs the training flow. Args: session: A Session object. Valid choises are: LocalSession, LocalHostScheduler, and DistributedSession. It is used to execute one TaskGroup a time.
Definition at line 689 of file checkpoint.py.