2 `torch.distributed.launch` is a module that spawns up multiple distributed 3 training processes on each of the training nodes. 5 The utility can be used for single-node distributed training, in which one or 6 more processes per node will be spawned. The utility can be used for either 7 CPU training or GPU training. If the utility is used for GPU training, 8 each distributed process will be operating on a single GPU. This can achieve 9 well-improved single-node training performance. It can also be used in 10 multi-node distributed training, by spawning up multiple processes on each node 11 for well-improved multi-node distributed training performance as well. 12 This will especially be benefitial for systems with multiple Infiniband 13 interfaces that have direct-GPU support, since all of them can be utilized for 14 aggregated communication bandwidth. 16 In both cases of single-node distributed training or multi-node distributed 17 training, this utility will launch the given number of processes per node 18 (``--nproc_per_node``). If used for GPU training, this number needs to be less 19 or euqal to the number of GPUs on the current system (``nproc_per_node``), 20 and each process will be operating on a single GPU from *GPU 0 to 21 GPU (nproc_per_node - 1)*. 23 **How to use this module:** 25 1. Single-Node multi-process distributed training 29 >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE 30 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other 31 arguments of your training script) 33 2. Multi-Node multi-process distributed training: (e.g. two nodes) 36 Node 1: *(IP: 192.168.1.1, and has a free port: 1234)* 40 >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE 41 --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" 42 --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 43 and all other arguments of your training script) 49 >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE 50 --nnodes=2 --node_rank=1 --master_addr="192.168.1.1" 51 --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 52 and all other arguments of your training script) 54 3. To look up what optional arguments this module offers: 58 >>> python -m torch.distributed.launch --help 61 **Important Notices:** 63 1. This utilty and multi-process distributed (single-node or 64 multi-node) GPU training currently only achieves the best performance using 65 the NCCL distributed backend. Thus NCCL backend is the recommended backend to 68 2. In your training program, you must parse the command-line argument: 69 ``--local_rank=LOCAL_PROCESS_RANK``, which will be provided by this module. 70 If your training program uses GPUs, you should ensure that your code only 71 runs on the GPU device of LOCAL_PROCESS_RANK. This can be done by: 73 Parsing the local_rank argument 78 >>> parser = argparse.ArgumentParser() 79 >>> parser.add_argument("--local_rank", type=int) 80 >>> args = parser.parse_args() 82 Set your device to local rank using either 86 >>> torch.cuda.set_device(arg.local_rank) # before your code runs 92 >>> with torch.cuda.device(arg.local_rank): 93 >>> # your code to run 95 3. In your training program, you are supposed to call the following function 96 at the beginning to start the distributed backend. You need to make sure that 97 the init_method uses ``env://``, which is the only supported ``init_method`` 102 torch.distributed.init_process_group(backend='YOUR BACKEND', 103 init_method='env://') 105 4. In your training program, you can either use regular distributed functions 106 or use :func:`torch.nn.parallel.DistributedDataParallel` module. If your 107 training program uses GPUs for training and you would like to use 108 :func:`torch.nn.parallel.DistributedDataParallel` module, 109 here is how to configure it. 113 model = torch.nn.parallel.DistributedDataParallel(model, 114 device_ids=[arg.local_rank], 115 output_device=arg.local_rank) 117 Please ensure that ``device_ids`` argument is set to be the only GPU device id 118 that your code will be operating on. This is generally the local rank of the 119 process. In other words, the ``device_ids`` needs to be ``[args.local_rank]``, 120 and ``output_device`` needs to be ``args.local_rank`` in order to use this 123 5. Another way to pass ``local_rank`` to the subprocesses via environment variable 124 ``LOCAL_RANK``. This behavior is enabled when you launch the script with 125 ``--use_env=True``. You must adjust the subprocess example above to replace 126 ``args.local_rank`` with ``os.environ['LOCAL_RANK']``; the launcher 127 will not pass ``--local_rank`` when you specify this flag. 131 ``local_rank`` is NOT globally unique: it is only unique per process 132 on a machine. Thus, don't use it to decide if you should, e.g., 133 write to a networked filesystem. See 134 https://github.com/pytorch/pytorch/issues/12042 for an example of 135 how things can go wrong if you don't do this correctly. 144 from argparse
import ArgumentParser, REMAINDER
151 Helper function parsing the command line options 152 @retval ArgumentParser 154 parser = ArgumentParser(description=
"PyTorch distributed training launch " 155 "helper utilty that will spawn up " 156 "multiple distributed processes")
159 parser.add_argument(
"--nnodes", type=int, default=1,
160 help=
"The number of nodes to use for distributed " 162 parser.add_argument(
"--node_rank", type=int, default=0,
163 help=
"The rank of the node for multi-node distributed " 165 parser.add_argument(
"--nproc_per_node", type=int, default=1,
166 help=
"The number of processes to launch on each node, " 167 "for GPU training, this is recommended to be set " 168 "to the number of GPUs in your system so that " 169 "each process can be bound to a single GPU.")
170 parser.add_argument(
"--master_addr", default=
"127.0.0.1", type=str,
171 help=
"Master node (rank 0)'s address, should be either " 172 "the IP address or the hostname of node 0, for " 173 "single node multi-proc training, the " 174 "--master_addr can simply be 127.0.0.1")
175 parser.add_argument(
"--master_port", default=29500, type=int,
176 help=
"Master node (rank 0)'s free port that needs to " 177 "be used for communciation during distributed " 179 parser.add_argument(
"--use_env", default=
False, action=
"store_true",
180 help=
"Use environment variable to pass " 181 "'local rank'. For legacy reasons, the default value is False. " 182 "If set to True, the script will not pass " 183 "--local_rank as argument, and will instead set LOCAL_RANK.")
186 parser.add_argument(
"training_script", type=str,
187 help=
"The full path to the single GPU training " 188 "program/script to be launched in parallel, " 189 "followed by all the arguments for the " 193 parser.add_argument(
'training_script_args', nargs=REMAINDER)
194 return parser.parse_args()
201 dist_world_size = args.nproc_per_node * args.nnodes
204 current_env = os.environ.copy()
205 current_env[
"MASTER_ADDR"] = args.master_addr
206 current_env[
"MASTER_PORT"] = str(args.master_port)
207 current_env[
"WORLD_SIZE"] = str(dist_world_size)
211 for local_rank
in range(0, args.nproc_per_node):
213 dist_rank = args.nproc_per_node * args.node_rank + local_rank
214 current_env[
"RANK"] = str(dist_rank)
215 current_env[
"LOCAL_RANK"] = str(local_rank)
219 cmd = [sys.executable,
"-u",
220 args.training_script] + args.training_script_args
222 cmd = [sys.executable,
224 args.training_script,
225 "--local_rank={}".format(local_rank)] + args.training_script_args
227 process = subprocess.Popen(cmd, env=current_env)
228 processes.append(process)
230 for process
in processes:
232 if process.returncode != 0:
233 raise subprocess.CalledProcessError(returncode=process.returncode,
237 if __name__ ==
"__main__":