Public Member Functions | |
| def | __init__ (self, job, checkpoint_manager=None, resume_from_epoch=None, upload_task_group_builder=None) | 
| def | train (self, session) | 
| def | load_blobs_from_checkpoints (self, blob_names, epoch, session) | 
| def | save_checkpoints (self, epoch, session) | 
Public Attributes | |
| resume_from_epoch | |
| checkpoint_manager | |
| job | |
| upload_task_group_builder | |
Implement the runtime logic for jobs with checkpointing at the level of epoch. Can be used to run either single-host or distributed jobs. Job runner is a callable to be called once from the master, passing a session as an argument. This call will block until the Job execution is complete. If a checkpoint_manager is passed, checkpoints will be taken after initialization and after each epoch execution. If, in addition, `resume_from_epoch` is an epoch number, the corresponding checkpoint will be loaded and job execution will continue from the given epoch. In this case, the job's init_group will not be run. Refer to checkpoint_test.py for an example.
Definition at line 655 of file checkpoint.py.
| def caffe2.python.checkpoint.JobRunner.__init__ | ( | self, | |
| job, | |||
checkpoint_manager = None,  | 
        |||
resume_from_epoch = None,  | 
        |||
upload_task_group_builder = None  | 
        |||
| ) | 
Initializes the JobRunner.
Args:
    job: A Job object. The job to be executed.
    checkpoint_manager: Can be a CheckpointManager for single machine
or a MultiNodeCheckpointManager for multi-machine. The manager
that initializes/saves/loads checkpoints.
    resume_from_epoch: An integer. The epoch to resume from.
    upload_task_group_builder: A subclass of the
UploadTaskGroupBuilder. Creates a task group to upload
checkpoints.
 
Definition at line 671 of file checkpoint.py.
| def caffe2.python.checkpoint.JobRunner.load_blobs_from_checkpoints | ( | self, | |
| blob_names, | |||
| epoch, | |||
| session | |||
| ) | 
Loads the necessary blobs from the checkpoints.
Checkpoints store the snapshots of the workspace in each node.
Sometimes we only need to load a subset of the blobs from the
checkpoints. One common scenario is to load only the model blobs from
the checkpoints for evaluation purpose. Given the names of the
necessary blobs, this function goes over all the checkpoints of all the
nodes, but only loads the blobs specified in the blob_names to the
current workspace.
Args:
    blob_names: A list of strings. Each string is the name of a
blob.
    epoch: An integer. The checkpoint epoch to load from.
    session: A Session object to execute the load ops.
Raises:
    ValueError: When the checkpoint manager is invalid.
 
Definition at line 761 of file checkpoint.py.
| def caffe2.python.checkpoint.JobRunner.save_checkpoints | ( | self, | |
| epoch, | |||
| session | |||
| ) | 
Triggers operation to save checkpoints
This method will trigger the Save ops to serialize and persist the
blobs present in the global workspaace.
Args:
    epoch: An integer. The checkpoint epoch-id that we are saving.
    session: A Session object to execute the save ops.
Raises:
    ValueError: When the checkpoint manager is invalid.
 
Definition at line 789 of file checkpoint.py.
| def caffe2.python.checkpoint.JobRunner.train | ( | self, | |
| session | |||
| ) | 
Runs the training flow.
Args:
    session: A Session object. Valid choises are: LocalSession,
LocalHostScheduler, and DistributedSession. It is used to
execute one TaskGroup a time.
 
Definition at line 689 of file checkpoint.py.
 1.8.11