Public Attributes | |
init_group | |
epoch_group | |
download_group | |
exit_group | |
stop_conditions | |
A Job defines three TaskGroups: the `init_group`, the `epoch_group` and the `exit_group` which will be run by a JobRunner. The `init_group` will be run only once at startup. Its role is to initialize globally persistent blobs such as model weights, accumulators and data file lists. The `epoch_group` will be run in a loop after init_group. The loop will exit when any of the stop signals added with `add_stop_condition` is True at the end of an epoch. The download_group will be run only once, after all the executions of epoch_group finish. Its role is to collect the distribute scattered parameters back after training. The `exit_group` will be run only once at the very end of the job, the role of this group is to save the results of training in the end of the job. Jobs are context-driven, so that Tasks can be added to the active Job without having to explicitly pass the job object around. Example of usage: def build_reader(partitions): with Job.current().init_group: reader = HiveReader(init_reader, ..., partitions) Task(step=init_reader) with Job.current().epoch_group: limited_reader = ReaderWithLimit(reader, num_iter=10000) data_queue = pipe(limited_reader, num_threads=8) Job.current().add_stop_condition(limited_reader.data_finished()) return data_queue def build_hogwild_trainer(reader, model): with Job.current().init_group: Task(step=model.param_init_net) with Job.current().epoch_group: pipe(reader, processor=model, num_threads=8) with Job.current().exit_group: Task(step=model.save_model_net) with Job() as job: reader = build_reader(partitions) model = build_model(params) build_hogwild_trainer(reader, model)
Definition at line 27 of file checkpoint.py.