Caffe2 - Python API
A deep learning, cross platform ML framework
Public Member Functions | Public Attributes | List of all members
caffe2.python.checkpoint.JobRunner Class Reference
Inheritance diagram for caffe2.python.checkpoint.JobRunner:

Public Member Functions

def __init__ (self, job, checkpoint_manager=None, resume_from_epoch=None, upload_task_group_builder=None)
def train (self, session)
def load_blobs_from_checkpoints (self, blob_names, epoch, session)
def save_checkpoints (self, epoch, session)

Public Attributes


Detailed Description

Implement the runtime logic for jobs with checkpointing at the level of
epoch. Can be used to run either single-host or distributed jobs. Job
runner is a callable to be called once from the master, passing a session
as an argument. This call will block until the Job execution is complete.

If a checkpoint_manager is passed, checkpoints will be taken after
initialization and after each epoch execution. If, in addition,
`resume_from_epoch` is an epoch number, the corresponding checkpoint will
be loaded and job execution will continue from the given epoch. In
this case, the job's init_group will not be run.

Refer to for an example.

Definition at line 655 of file

Constructor & Destructor Documentation

def caffe2.python.checkpoint.JobRunner.__init__ (   self,
  checkpoint_manager = None,
  resume_from_epoch = None,
  upload_task_group_builder = None 
Initializes the JobRunner.

    job: A Job object. The job to be executed.
    checkpoint_manager: Can be a CheckpointManager for single machine
or a MultiNodeCheckpointManager for multi-machine. The manager
that initializes/saves/loads checkpoints.
    resume_from_epoch: An integer. The epoch to resume from.
    upload_task_group_builder: A subclass of the
UploadTaskGroupBuilder. Creates a task group to upload

Definition at line 671 of file

Member Function Documentation

def caffe2.python.checkpoint.JobRunner.load_blobs_from_checkpoints (   self,
Loads the necessary blobs from the checkpoints.

Checkpoints store the snapshots of the workspace in each node.
Sometimes we only need to load a subset of the blobs from the
checkpoints. One common scenario is to load only the model blobs from
the checkpoints for evaluation purpose. Given the names of the
necessary blobs, this function goes over all the checkpoints of all the
nodes, but only loads the blobs specified in the blob_names to the
current workspace.

    blob_names: A list of strings. Each string is the name of a
    epoch: An integer. The checkpoint epoch to load from.
    session: A Session object to execute the load ops.

    ValueError: When the checkpoint manager is invalid.

Definition at line 761 of file

def caffe2.python.checkpoint.JobRunner.save_checkpoints (   self,
Triggers operation to save checkpoints

This method will trigger the Save ops to serialize and persist the
blobs present in the global workspaace.

    epoch: An integer. The checkpoint epoch-id that we are saving.
    session: A Session object to execute the save ops.

    ValueError: When the checkpoint manager is invalid.

Definition at line 789 of file

def caffe2.python.checkpoint.JobRunner.train (   self,
Runs the training flow.

    session: A Session object. Valid choises are: LocalSession,
LocalHostScheduler, and DistributedSession. It is used to
execute one TaskGroup a time.

Definition at line 689 of file

The documentation for this class was generated from the following file: