Caffe2 - Python API
A deep learning, cross platform ML framework
launch.py
1 r"""
2 `torch.distributed.launch` is a module that spawns up multiple distributed
3 training processes on each of the training nodes.
4 
5 The utility can be used for single-node distributed training, in which one or
6 more processes per node will be spawned. The utility can be used for either
7 CPU training or GPU training. If the utility is used for GPU training,
8 each distributed process will be operating on a single GPU. This can achieve
9 well-improved single-node training performance. It can also be used in
10 multi-node distributed training, by spawning up multiple processes on each node
11 for well-improved multi-node distributed training performance as well.
12 This will especially be benefitial for systems with multiple Infiniband
13 interfaces that have direct-GPU support, since all of them can be utilized for
14 aggregated communication bandwidth.
15 
16 In both cases of single-node distributed training or multi-node distributed
17 training, this utility will launch the given number of processes per node
18 (``--nproc_per_node``). If used for GPU training, this number needs to be less
19 or euqal to the number of GPUs on the current system (``nproc_per_node``),
20 and each process will be operating on a single GPU from *GPU 0 to
21 GPU (nproc_per_node - 1)*.
22 
23 **How to use this module:**
24 
25 1. Single-Node multi-process distributed training
26 
27 ::
28 
29  >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
30  YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other
31  arguments of your training script)
32 
33 2. Multi-Node multi-process distributed training: (e.g. two nodes)
34 
35 
36 Node 1: *(IP: 192.168.1.1, and has a free port: 1234)*
37 
38 ::
39 
40  >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
41  --nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
42  --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
43  and all other arguments of your training script)
44 
45 Node 2:
46 
47 ::
48 
49  >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
50  --nnodes=2 --node_rank=1 --master_addr="192.168.1.1"
51  --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
52  and all other arguments of your training script)
53 
54 3. To look up what optional arguments this module offers:
55 
56 ::
57 
58  >>> python -m torch.distributed.launch --help
59 
60 
61 **Important Notices:**
62 
63 1. This utilty and multi-process distributed (single-node or
64 multi-node) GPU training currently only achieves the best performance using
65 the NCCL distributed backend. Thus NCCL backend is the recommended backend to
66 use for GPU training.
67 
68 2. In your training program, you must parse the command-line argument:
69 ``--local_rank=LOCAL_PROCESS_RANK``, which will be provided by this module.
70 If your training program uses GPUs, you should ensure that your code only
71 runs on the GPU device of LOCAL_PROCESS_RANK. This can be done by:
72 
73 Parsing the local_rank argument
74 
75 ::
76 
77  >>> import argparse
78  >>> parser = argparse.ArgumentParser()
79  >>> parser.add_argument("--local_rank", type=int)
80  >>> args = parser.parse_args()
81 
82 Set your device to local rank using either
83 
84 ::
85 
86  >>> torch.cuda.set_device(arg.local_rank) # before your code runs
87 
88 or
89 
90 ::
91 
92  >>> with torch.cuda.device(arg.local_rank):
93  >>> # your code to run
94 
95 3. In your training program, you are supposed to call the following function
96 at the beginning to start the distributed backend. You need to make sure that
97 the init_method uses ``env://``, which is the only supported ``init_method``
98 by this module.
99 
100 ::
101 
102  torch.distributed.init_process_group(backend='YOUR BACKEND',
103  init_method='env://')
104 
105 4. In your training program, you can either use regular distributed functions
106 or use :func:`torch.nn.parallel.DistributedDataParallel` module. If your
107 training program uses GPUs for training and you would like to use
108 :func:`torch.nn.parallel.DistributedDataParallel` module,
109 here is how to configure it.
110 
111 ::
112 
113  model = torch.nn.parallel.DistributedDataParallel(model,
114  device_ids=[arg.local_rank],
115  output_device=arg.local_rank)
116 
117 Please ensure that ``device_ids`` argument is set to be the only GPU device id
118 that your code will be operating on. This is generally the local rank of the
119 process. In other words, the ``device_ids`` needs to be ``[args.local_rank]``,
120 and ``output_device`` needs to be ``args.local_rank`` in order to use this
121 utility
122 
123 5. Another way to pass ``local_rank`` to the subprocesses via environment variable
124 ``LOCAL_RANK``. This behavior is enabled when you launch the script with
125 ``--use_env=True``. You must adjust the subprocess example above to replace
126 ``args.local_rank`` with ``os.environ['LOCAL_RANK']``; the launcher
127 will not pass ``--local_rank`` when you specify this flag.
128 
129 .. warning::
130 
131  ``local_rank`` is NOT globally unique: it is only unique per process
132  on a machine. Thus, don't use it to decide if you should, e.g.,
133  write to a networked filesystem. See
134  https://github.com/pytorch/pytorch/issues/12042 for an example of
135  how things can go wrong if you don't do this correctly.
136 
137 """
138 
139 
140 import sys
141 import subprocess
142 import os
143 import socket
144 from argparse import ArgumentParser, REMAINDER
145 
146 import torch
147 
148 
149 def parse_args():
150  """
151  Helper function parsing the command line options
152  @retval ArgumentParser
153  """
154  parser = ArgumentParser(description="PyTorch distributed training launch "
155  "helper utilty that will spawn up "
156  "multiple distributed processes")
157 
158  # Optional arguments for the launch helper
159  parser.add_argument("--nnodes", type=int, default=1,
160  help="The number of nodes to use for distributed "
161  "training")
162  parser.add_argument("--node_rank", type=int, default=0,
163  help="The rank of the node for multi-node distributed "
164  "training")
165  parser.add_argument("--nproc_per_node", type=int, default=1,
166  help="The number of processes to launch on each node, "
167  "for GPU training, this is recommended to be set "
168  "to the number of GPUs in your system so that "
169  "each process can be bound to a single GPU.")
170  parser.add_argument("--master_addr", default="127.0.0.1", type=str,
171  help="Master node (rank 0)'s address, should be either "
172  "the IP address or the hostname of node 0, for "
173  "single node multi-proc training, the "
174  "--master_addr can simply be 127.0.0.1")
175  parser.add_argument("--master_port", default=29500, type=int,
176  help="Master node (rank 0)'s free port that needs to "
177  "be used for communciation during distributed "
178  "training")
179  parser.add_argument("--use_env", default=False, action="store_true",
180  help="Use environment variable to pass "
181  "'local rank'. For legacy reasons, the default value is False. "
182  "If set to True, the script will not pass "
183  "--local_rank as argument, and will instead set LOCAL_RANK.")
184 
185  # positional
186  parser.add_argument("training_script", type=str,
187  help="The full path to the single GPU training "
188  "program/script to be launched in parallel, "
189  "followed by all the arguments for the "
190  "training script")
191 
192  # rest from the training program
193  parser.add_argument('training_script_args', nargs=REMAINDER)
194  return parser.parse_args()
195 
196 
197 def main():
198  args = parse_args()
199 
200  # world size in terms of number of processes
201  dist_world_size = args.nproc_per_node * args.nnodes
202 
203  # set PyTorch distributed related environmental variables
204  current_env = os.environ.copy()
205  current_env["MASTER_ADDR"] = args.master_addr
206  current_env["MASTER_PORT"] = str(args.master_port)
207  current_env["WORLD_SIZE"] = str(dist_world_size)
208 
209  processes = []
210 
211  for local_rank in range(0, args.nproc_per_node):
212  # each process's rank
213  dist_rank = args.nproc_per_node * args.node_rank + local_rank
214  current_env["RANK"] = str(dist_rank)
215  current_env["LOCAL_RANK"] = str(local_rank)
216 
217  # spawn the processes
218  if args.use_env:
219  cmd = [sys.executable, "-u",
220  args.training_script] + args.training_script_args
221  else:
222  cmd = [sys.executable,
223  "-u",
224  args.training_script,
225  "--local_rank={}".format(local_rank)] + args.training_script_args
226 
227  process = subprocess.Popen(cmd, env=current_env)
228  processes.append(process)
229 
230  for process in processes:
231  process.wait()
232  if process.returncode != 0:
233  raise subprocess.CalledProcessError(returncode=process.returncode,
234  cmd=process.args)
235 
236 
237 if __name__ == "__main__":
238  main()