PyTorch Demo-5 : 多GPU训练踩坑

2024-09-05 01:38

文章标签 训练 gpu pytorch demo

本文主要是介绍PyTorch Demo-5 : 多GPU训练踩坑，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

当数据量或者模型很大的时候往往单GPU已经无法满足我们的需求了，为了能够跑更大型的数据，多GPU训练是必要的。

PyTorch多卡训练的文章已经有很多，也写的很详细，比如：
https://zhuanlan.zhihu.com/p/98535650
https://zhuanlan.zhihu.com/p/74792767
不过写法各异，整合到自己的祖传代码里又有一些问题，在此记录一下踩坑。

DataParallel (DP)

最简单的是DP，只需要对model直接调用就可以了，更多细节可以参考前面的链接

gpus = [0, 1]
model = model.cuda(gpus)
model = nn.DataParallel(model, device_ids=gpus, output_device=gpus[0])

训练过程中需要把data设置 non_blocking=True ，参考non_blocking：

for idx, (data, target) in enumerate(train_loader):images = images.cuda(non_blocking=True)target = target.cuda(non_blocking=True)

DP只能用于单机多卡，由主卡分发再在主卡统筹，所以负载不均衡的问题比较严重，通常主卡会多占用1-2G显存，而且效率没有DDP高。

DistributedDataParallel (DDP)

采用all-reduce算法，适用于多机多卡，也适用于单机多卡。关于DDP的细节还是参考链接写的更清楚。
主要步骤：

在argparser里面定义一个local_rank, 用于确定当前进程所在的GPU

parser.add_argument('--local_rank', default=-1, type=int,help='node rank for distributed training')

初始化通信方式和端口，设定当前的GPU号

torch.distributed.init_process_group(backend='nccl')
torch.cuda.set_device(args.local_rank)

分发训练数据

trainset = ...
train_sampler = None
# 设定一下参数，调用多卡才用
if use_multi_gpus:train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = DataLoader(trainset, batch_size=...,shuffle=(train_sampler is None),sampler=train_sampler)

分配模型

model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])

训练时数据

for idx, (data, target) in enumerate(train_loader):images = images.cuda(non_blocking=True)target = target.cuda(non_blocking=True)

使用DDP需要在终端启动

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 torch_ddp.py

加入祖传代码

祖传代码Git
由于DDP模式是开启了多个进程来执行，因此在打印log和存储的时候可能会冲突导致打印的内容混乱，可以指定某一个rank打印，或者分批，把local_rank的默认值设为0的话，单卡的时候也能通用了：

# 指定rank打印
if args.local_rank == 0:print(f'loss:{loss:.4f}, acc:{acc:.4f} ...')# 打印出rank
print(f'rank:{args.local_rank} loss:{loss:.4f}, acc:{acc:.4f} ...')

参数存储通常是在测试时进行的，一方面可以指定 local_rank == 0 才存储，但是实际上，测试的时候每个GPU上的模型都是一样的，因此可以只测试一次，在循环交替的时候直接指定：

for epoch in range(total_epoch):train()scheduler.step()if args.local_rank == 0:test()

存储参数，在DDP模式下，直接存储参数key会变成model.module，可以在存之前先改成正常的state_dict：

if isinstance(model, nn.parallel.distributed.DistributedDataParallel):state = {'net': model.module.state_dict(),'acc': acc,'epoch': epoch}

DDP启动时可能会遇到地址冲突的情况，在启动命令中加入地址和端口参考

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_addr 127.0.0.3 --master_port 23456 torch_ddp.py

find_unused_parameters=True错误，参考
参考1 参考2

这篇关于PyTorch Demo-5 : 多GPU训练踩坑的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

PyTorch Demo-5 : 多GPU训练踩坑

DataParallel (DP)

DistributedDataParallel (DDP)

加入祖传代码

相关文章

PyTorch中的词嵌入层(nn.Embedding)详解与实战应用示例

Python中Tensorflow无法调用GPU问题的解决方法

Pytorch介绍与安装过程

conda安装GPU版pytorch默认却是cpu版本

PyTorch中cdist和sum函数使用示例详解

PyTorch高级特性与性能优化方式

判断PyTorch是GPU版还是CPU版的方法小结

pytorch自动求梯度autograd的实现

在PyCharm中安装PyTorch、torchvision和OpenCV详解

pytorch之torch.flatten()和torch.nn.Flatten()的用法