기술/PyTorch

DistributedDataParallel 관련 링크 정리

https://pytorch.org/docs/stable/notes/cuda.html?highlight=torch%20distributed%20init_process_group 

 

CUDA semantics — PyTorch 1.10.1 documentation

CUDA semantics torch.cuda is used to set up and run CUDA operations. It keeps track of the currently selected GPU, and all CUDA tensors you allocate will by default be created on that device. The selected device can be changed with a torch.cuda.device cont

pytorch.org

https://pytorch.org/docs/stable/notes/ddp.html?highlight=torch%20distributed%20init_process_group 

 

Distributed Data Parallel — PyTorch 1.10.1 documentation

Distributed Data Parallel torch.nn.parallel.DistributedDataParallel (DDP) transparently performs distributed data parallel training. This page describes how it works and reveals implementation details. Example Let us start with a simple torch.nn.parallel.D

pytorch.org

https://pytorch.org/docs/stable/distributed.html#initialization

 

Distributed communication package - torch.distributed — PyTorch 1.10.1 documentation

Shortcuts

pytorch.org

https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group

 

Distributed communication package - torch.distributed — PyTorch 1.10.1 documentation

Shortcuts

pytorch.org

https://developer0hye.tistory.com/entry/PyTorch-DistributedDataParallel-%EC%98%88%EC%8B%9C-%EC%BD%94%EB%93%9C-%EB%B0%8F-%EC%B0%B8%EA%B3%A0-%EC%9E%90%EB%A3%8C-%EB%AA%A8%EC%9D%8C

 

[PyTorch] DistributedDataParallel 예시 코드 및 참고 자료 모음

기존에 Single node, multiple GPUs System(그냥 PC 1대에 GPU 여러대 꽂힌 피시로 생각, pytorch 공식 문서에서 이렇게 표기했길래 따라 씀) 에서 multiple gpu 를 활용함에 있어 DataParallel 모듈을 활용했다...

developer0hye.tistory.com

https://github.com/pytorch/examples/blob/151944ecaf9ba2c8288ee550143ae7ffdaa90a80/imagenet/main.py#L81

 

GitHub - pytorch/examples: A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc. - GitHub - pytorch/examples: A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.

github.com

https://github.com/facebookresearch/deit/blob/main/main.py

 

GitHub - facebookresearch/deit: Official DeiT repository

Official DeiT repository. Contribute to facebookresearch/deit development by creating an account on GitHub.

github.com

https://tutorials.pytorch.kr/intermediate/dist_tuto.html

 

PyTorch로 분산 어플리케이션 개발하기 — PyTorch Tutorials 1.10.0+cu102 documentation

PyTorch로 분산 어플리케이션 개발하기 Author: Séb Arnold 번역: 박정환 선수과목(Prerequisites): 이 짧은 튜토리얼에서는 PyTorch의 분산 패키지를 둘러볼 예정입니다. 여기에서는 어떻게 분산 환경을 설

tutorials.pytorch.kr

https://github.com/seba-1511/dist_tuto.pth/blob/gh-pages/train_dist.py

 

GitHub - seba-1511/dist_tuto.pth: Official code for "Writing Distributed Applications with PyTorch", PyTorch Tutorial

Official code for "Writing Distributed Applications with PyTorch", PyTorch Tutorial - GitHub - seba-1511/dist_tuto.pth: Official code for "Writing Distributed Applications with PyTor...

github.com

https://stackoverflow.com/questions/66498045/how-to-solve-dist-init-process-group-from-hanging-or-deadlocks

 

How to solve dist.init_process_group from hanging (or deadlocks)?

I was to set up DDP (distributed data parallel) on a DGX A100 but it doesn't work. Whenever I try to run it simply hangs. My code is super simple just spawning 4 processes for 4 gpus (for the sake of

stackoverflow.com

https://github.com/pytorch/examples/tree/master/distributed/ddp

 

GitHub - pytorch/examples: A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc. - GitHub - pytorch/examples: A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.

github.com

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

 

Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.10.1+cu102 documentation

Getting Started with Distributed Data Parallel Author: Shen Li Edited by: Joe Zhu Prerequisites: DistributedDataParallel (DDP) implements data parallelism at the module level which can run across multiple machines. Applications using DDP should spawn multi

pytorch.org

 

'기술 > PyTorch' 카테고리의 다른 글

nn.AdaptiveAvgPool1d  (0) 2022.02.10
torchvision.models 의 IntermediateLayerGetter  (1) 2021.10.06
pbar 고급진 사용법  (0) 2021.09.13
nn.Sequential 에 OrderedDict 넣어줘도 됨  (0) 2021.08.18
실제 torch weights 살펴보기  (0) 2021.08.17