Caffe2 - Python API
A deep learning, cross platform ML framework
distributed_cpu.py
1 import torch
2 from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
3 import torch.distributed as dist
4 from torch.nn.modules import Module
5 from collections import defaultdict
6 from torch.autograd import Variable
8 
9 
11  r"""Implements distributed data parallelism for CPU at the module level.
12 
13  This module supports the ``mpi`` and ``gloo`` backends.
14 
15  This container parallelizes the application of the given module by splitting
16  the input across the specified devices by chunking in the batch
17  dimension. The module is replicated on each machine, and each such replica
18  handles a portion of the input. During the backwards pass, gradients from
19  each node are averaged.
20 
21  This module could be used in conjunction with the DistributedSampler,
22  (see :class:`~torch.utils.data.distributed.DistributedSampler`)
23  which will load a subset of the original dataset for each node with the same
24  batch size. So strong scaling should be configured like this:
25 
26  n = 1, batch size = 12
27 
28  n = 2, batch size = 64
29 
30  n = 4, batch size = 32
31 
32  n = 8, batch size = 16
33 
34  Creation of this class requires the distributed package to be already
35  initialized in the process group mode
36  (see :func:`torch.distributed.init_process_group`).
37 
38  .. warning::
39  Constructor, forward method, and differentiation of the output (or a
40  function of the output of this module) is a distributed synchronization
41  point. Take that into account in case different node might be
42  executing different code.
43 
44  .. warning::
45  This module assumes all parameters are registered in the model by the
46  time it is created. No parameters should be added nor removed later.
47 
48  .. warning::
49  This module assumes all gradients are dense.
50 
51  .. warning::
52  This module doesn't work with :func:`torch.autograd.grad` (i.e. it will
53  only work if gradients are to be accumulated in ``.grad`` attributes of
54  parameters).
55 
56  .. warning::
57  Forward and backward hooks defined on :attr:`module` and its submodules
58  won't be invoked anymore, unless the hooks are initialized in the
59  :meth:`forward` method.
60 
61  .. note::
62  Parameters are broadcast between nodes in the __init__() function. The
63  module performs an all-reduce step on gradients and assumes that they
64  will be modified by the optimizer in all nodes in the same way.
65 
66  Args:
67  module: module to be parallelized
68 
69  Example::
70 
71  >>> torch.distributed.init_process_group(world_size=4, init_method='...')
72  >>> net = torch.nn.DistributedDataParallelCPU(model)
73  """
74 
75  def __init__(self, module):
76  super(DistributedDataParallelCPU, self).__init__()
77  self.module = module
78  self.sync_parameters()
79 
80  def allreduce_params():
81  if self.needs_reduction:
82  self.needs_reduction = False
83  buckets = defaultdict(list)
84  for param in self.module.parameters():
85  if param.requires_grad and param.grad is not None:
86  tp = type(param.data)
87  buckets[tp].append(param)
88 
89  for bucket in buckets.values():
90  grads = [param.grad.data for param in bucket]
91  coalesced = _flatten_dense_tensors(grads)
92  dist.all_reduce(coalesced)
93  coalesced /= dist.get_world_size()
94  for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)):
95  buf.copy_(synced)
96 
97  for param in list(self.module.parameters()):
98  @torch.utils.hooks.unserializable_hook
99  def allreduce_hook(*unused):
100  Variable._execution_engine.queue_callback(allreduce_params)
101 
102  if param.requires_grad:
103  param.register_hook(allreduce_hook)
104 
105  def sync_parameters(self):
106  for param in self.module.parameters():
107  dist.broadcast(param.data, 0)
108 
109  def forward(self, *inputs, **kwargs):
110  self.needs_reduction = True
111  return self.module(*inputs, **kwargs)