PyTorch——利用Accelerate轻松控制多个CPU/GPU/TPU加速计算

2024-03-09 10:30

本文主要是介绍PyTorch——利用Accelerate轻松控制多个CPU/GPU/TPU加速计算,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

PyTorch——利用Accelerate轻松控制多个CPU/GPU/TPU加速计算

    • 前言
    • 官方示例
    • 单个程序内控制多个CPU/GPU/TPU
      • 简单说一下
      • 设备环境
      • 导包
      • 加载数据 FashionMNIST
      • 创建一个简单的CNN模型
      • 训练函数-只包含训练
      • 训练函数-包含训练和验证
      • 训练
    • 多个服务器、多个程序间控制多个CPU/GPU/TPU
    • 参考链接

前言

  • CPU?GPU?TPU?
    • 计算设备太多,很混乱?
    • 切换环境,代码大量改来改去?
    • 不懂怎么调用多个CPU/GPU/TPU?或者想轻松调用?
  • OK!OK!OK!
    • 来自HuggingFace的Accelerate库帮你轻松解决这些问题,只需几行代码改动就可以快速完成计算设备的自动调整。
      huggingface
  • 相关地址
    • 官方文档:https://huggingface.co/docs/accelerate/index
    • GitHub:https://github.com/huggingface/accelerate
    • 安装(推荐用>=0.14的版本) $ pip install accelerate
  • 下面就来说说怎么用
    • 你也可以直接看我在Kaggle上做好的完整的Notebook示例

官方示例

  • 先大致看个样
  • 移除掉以前.to(device)部分的代码,引入Acceleratormodel、optimizer、data、loss.backward()做下处理即可
import torch
import torch.nn.functional as F
from datasets import load_dataset
from accelerate import Accelerator# device = 'cpu'
accelerator = Accelerator()# model = torch.nn.Transformer().to(device)
model = torch.nn.Transformer()
optimizer = torch.optim.Adam(model.parameters())dataset = load_dataset('my_dataset')
data = torch.utils.data.DataLoader(dataset, shuffle=True)model, optimizer, data = accelerator.prepare(model, optimizer, data)model.train()
for epoch in range(10):for source, targets in data:# source = source.to(device)# targets = targets.to(device)optimizer.zero_grad()output = model(source)loss = F.cross_entropy(output, targets)# loss.backward()accelerator.backward(loss)optimizer.step()

单个程序内控制多个CPU/GPU/TPU

  • 详细内容请参考官方Example

简单说一下

  • 对于单个计算设备,像前面那个简单示例改下代码即可
  • 多个计算设备(例如GPU)的情况下,有一点特殊的要处理,下面做个完整的PyTorch训练示例
    • 你可以拿这个和我之前发的示例做个对比 CNN图像分类-FashionMNIST
    • 也可以直接看我在Kaggle上做好的完整的Notebook示例

设备环境

  • 看看当前的显卡设备(2颗Tesla T4),命令 $ nvidia-smi
Thu Apr 27 10:53:26 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  • 安装或更新Accelerate,命令 $ !pip install --upgrade accelerate

导包

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor, Compose
import torchvision.datasets as datasets
from accelerate import Accelerator
from accelerate import notebook_launcher

加载数据 FashionMNIST

train_data = datasets.FashionMNIST(root="./data",train=True,download=True,transform=Compose([ToTensor()])
)test_data = datasets.FashionMNIST(root="./data",train=False,download=True,transform=Compose([ToTensor()])
)print(train_data.data.shape)
print(test_data.data.shape)

创建一个简单的CNN模型

class CNNModel(nn.Module):def __init__(self):super(CNNModel, self).__init__()self.module1 = nn.Sequential(nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=2),nn.BatchNorm2d(32),nn.ReLU(),nn.MaxPool2d(kernel_size=2, stride=2))  self.module2 = nn.Sequential(nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2),nn.BatchNorm2d(64),nn.ReLU(),nn.MaxPool2d(kernel_size=2, stride=2))self.flatten = nn.Flatten()self.linear1 = nn.Linear(7 * 7 * 64, 64)self.linear2 = nn.Linear(64, 10)self.relu = nn.ReLU()def forward(self, x):out = self.module1(x)out = self.module2(out)out = self.flatten(out)out = self.linear1(out)out = self.relu(out)out = self.linear2(out)return out

训练函数-只包含训练

  • 注意看accelerator相关代码
  • 若要实现多设备控制训练,for epoch in range(epoch_num):中末尾处的代码必不可少
def training_function():# 参数配置epoch_num = 4batch_size = 64learning_rate = 0.005# device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')# 数据train_loader = DataLoader(dataset=train_data, batch_size=batch_size, shuffle=True)val_loader = DataLoader(test_data, batch_size=batch_size, shuffle=True)# 模型/损失函数/优化器# model = CNNModel().to(device)model = CNNModel()criterion = nn.CrossEntropyLoss()optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)accelerator = Accelerator()model, optimizer, train_loader, val_loader = accelerator.prepare(model, optimizer, train_loader, val_loader)# 开始训练for epoch in range(epoch_num):# 训练model.train()for i, (X_train, y_train) in enumerate(train_loader):# X_train = X_train.to(device)# y_train = y_train.to(device)out = model(X_train)loss = criterion(out, y_train)optimizer.zero_grad()# loss.backward()accelerator.backward(loss)optimizer.step()if (i + 1) % 100 == 0:print(f"{accelerator.device} Train... [epoch {epoch + 1}/{epoch_num}, step {i + 1}/{len(train_loader)}]\t[loss {loss.item()}]")# 等待每个GPU上的模型执行完当前的epoch,并进行合并同步accelerator.wait_for_everyone() model = accelerator.unwrap_model(model)# 现在所有GPU上都一样了,可以保存modelaccelerator.save(model, "model.pth") 

训练函数-包含训练和验证

  • 相比前面的代码,多了“验证”相关的代码
  • 验证时,因为使用多个设备进行训练,所以会比较特殊,会涉及到多个设备的验证结果合并的问题
def training_function():# 参数配置epoch_num = 4batch_size = 64learning_rate = 0.005# 数据train_loader = DataLoader(dataset=train_data, batch_size=batch_size, shuffle=True)val_loader = DataLoader(test_data, batch_size=batch_size, shuffle=True)# 模型/损失函数/优化器model = CNNModel()criterion = nn.CrossEntropyLoss()optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)accelerator = Accelerator()model, optimizer, train_loader, val_loader = accelerator.prepare(model, optimizer, train_loader, val_loader)# 开始训练for epoch in range(epoch_num):# 训练model.train()for i, (X_train, y_train) in enumerate(train_loader):out = model(X_train)loss = criterion(out, y_train)optimizer.zero_grad()accelerator.backward(loss)optimizer.step()if (i + 1) % 100 == 0:print(f"{accelerator.device} Train... [epoch {epoch + 1}/{epoch_num}, step {i + 1}/{len(train_loader)}]\t[loss {loss.item()}]")# 验证model.eval()correct, total = 0, 0for X_val, y_val in val_loader:with torch.no_grad():output = model(X_val)_, pred = torch.max(output, 1)# 合并每个GPU的验证数据pred, y_val = accelerator.gather_for_metrics((pred, y_val))total += y_val.size(0)correct += (pred == y_val).sum()# 用main process打印accuracyaccelerator.print(f'epoch {epoch + 1}/{epoch_num}, accuracy = {100 * (correct.item() / total):.2f}')# 等待每个GPU上的模型执行完当前的epoch,并进行合并同步accelerator.wait_for_everyone() model = accelerator.unwrap_model(model)# 现在所有GPU上都一样了,可以保存modelaccelerator.save(model, "model.pth") 

训练

  • 如果你在本地训练的话,直接调用前面定义的函数training_function即可。最后在命令行启动训练脚本 $ accelerate launch example.py
training_function()
  • 如果你在Kaggle/Colab上面,则需要利用notebook_launcher进行训练
# num_processes=2 指定使用2个GPU,因为当前我申请了2颗 Nvidia T4
notebook_launcher(training_function, num_processes=2)
  • 下面是2个GPU训练时的控制台输出样例
Launching training on 2 GPUs.
cuda:0 Train... [epoch 1/4, step 100/469]	[loss 0.43843933939933777]
cuda:1 Train... [epoch 1/4, step 100/469]	[loss 0.5267877578735352]
cuda:0 Train... [epoch 1/4, step 200/469]	[loss 0.39918822050094604]cuda:1 Train... [epoch 1/4, step 200/469]	[loss 0.2748252749443054]cuda:1 Train... [epoch 1/4, step 300/469]	[loss 0.54105544090271]cuda:0 Train... [epoch 1/4, step 300/469]	[loss 0.34716445207595825]cuda:1 Train... [epoch 1/4, step 400/469]	[loss 0.2694844901561737]
cuda:0 Train... [epoch 1/4, step 400/469]	[loss 0.4343942701816559]
epoch 1/4, accuracy = 88.49
cuda:0 Train... [epoch 2/4, step 100/469]	[loss 0.19695354998111725]
cuda:1 Train... [epoch 2/4, step 100/469]	[loss 0.2911057770252228]
cuda:0 Train... [epoch 2/4, step 200/469]	[loss 0.2948791980743408]
cuda:1 Train... [epoch 2/4, step 200/469]	[loss 0.292676717042923]
cuda:0 Train... [epoch 2/4, step 300/469]	[loss 0.222089946269989]
cuda:1 Train... [epoch 2/4, step 300/469]	[loss 0.28814008831977844]
cuda:0 Train... [epoch 2/4, step 400/469]	[loss 0.3431250751018524]
cuda:1 Train... [epoch 2/4, step 400/469]	[loss 0.2546379864215851]
epoch 2/4, accuracy = 87.31
cuda:1 Train... [epoch 3/4, step 100/469]	[loss 0.24118559062480927]cuda:0 Train... [epoch 3/4, step 100/469]	[loss 0.363821804523468]cuda:0 Train... [epoch 3/4, step 200/469]	[loss 0.36783623695373535]
cuda:1 Train... [epoch 3/4, step 200/469]	[loss 0.18346744775772095]
cuda:0 Train... [epoch 3/4, step 300/469]	[loss 0.23459288477897644]
cuda:1 Train... [epoch 3/4, step 300/469]	[loss 0.2887689769268036]
cuda:0 Train... [epoch 3/4, step 400/469]	[loss 0.3079166114330292]
cuda:1 Train... [epoch 3/4, step 400/469]	[loss 0.18255220353603363]
epoch 3/4, accuracy = 88.46
cuda:1 Train... [epoch 4/4, step 100/469]	[loss 0.27428603172302246]
cuda:0 Train... [epoch 4/4, step 100/469]	[loss 0.17705145478248596]
cuda:1 Train... [epoch 4/4, step 200/469]	[loss 0.2811894416809082]
cuda:0 Train... [epoch 4/4, step 200/469]	[loss 0.22682836651802063]
cuda:0 Train... [epoch 4/4, step 300/469]	[loss 0.2291710525751114]
cuda:1 Train... [epoch 4/4, step 300/469]	[loss 0.32024848461151123]
cuda:0 Train... [epoch 4/4, step 400/469]	[loss 0.24648766219615936]
cuda:1 Train... [epoch 4/4, step 400/469]	[loss 0.0805584192276001]
epoch 4/4, accuracy = 89.38
  • 下面是1个TPU训练时的控制台输出样例
Launching training on CPU.
xla:0 Train... [epoch 1/4, step 100/938]	[loss 0.6051161289215088]
xla:0 Train... [epoch 1/4, step 200/938]	[loss 0.27442359924316406]
xla:0 Train... [epoch 1/4, step 300/938]	[loss 0.557417631149292]
xla:0 Train... [epoch 1/4, step 400/938]	[loss 0.1840067058801651]
xla:0 Train... [epoch 1/4, step 500/938]	[loss 0.5252436399459839]
xla:0 Train... [epoch 1/4, step 600/938]	[loss 0.2718536853790283]
xla:0 Train... [epoch 1/4, step 700/938]	[loss 0.2763175368309021]
xla:0 Train... [epoch 1/4, step 800/938]	[loss 0.39897507429122925]
xla:0 Train... [epoch 1/4, step 900/938]	[loss 0.28720396757125854]
epoch = 0, accuracy = 86.36
xla:0 Train... [epoch 2/4, step 100/938]	[loss 0.24496735632419586]
xla:0 Train... [epoch 2/4, step 200/938]	[loss 0.37713131308555603]
xla:0 Train... [epoch 2/4, step 300/938]	[loss 0.3106330633163452]
xla:0 Train... [epoch 2/4, step 400/938]	[loss 0.40438592433929443]
xla:0 Train... [epoch 2/4, step 500/938]	[loss 0.38303741812705994]
xla:0 Train... [epoch 2/4, step 600/938]	[loss 0.39199298620224]
xla:0 Train... [epoch 2/4, step 700/938]	[loss 0.38932573795318604]
xla:0 Train... [epoch 2/4, step 800/938]	[loss 0.26298171281814575]
xla:0 Train... [epoch 2/4, step 900/938]	[loss 0.21517205238342285]
epoch = 1, accuracy = 90.07
xla:0 Train... [epoch 3/4, step 100/938]	[loss 0.366019606590271]
xla:0 Train... [epoch 3/4, step 200/938]	[loss 0.27360212802886963]
xla:0 Train... [epoch 3/4, step 300/938]	[loss 0.2014923095703125]
xla:0 Train... [epoch 3/4, step 400/938]	[loss 0.21998485922813416]
xla:0 Train... [epoch 3/4, step 500/938]	[loss 0.28129786252975464]
xla:0 Train... [epoch 3/4, step 600/938]	[loss 0.42534705996513367]
xla:0 Train... [epoch 3/4, step 700/938]	[loss 0.22158119082450867]
xla:0 Train... [epoch 3/4, step 800/938]	[loss 0.359947144985199]
xla:0 Train... [epoch 3/4, step 900/938]	[loss 0.3221997022628784]
epoch = 2, accuracy = 90.36
xla:0 Train... [epoch 4/4, step 100/938]	[loss 0.2814193069934845]
xla:0 Train... [epoch 4/4, step 200/938]	[loss 0.16465164721012115]
xla:0 Train... [epoch 4/4, step 300/938]	[loss 0.2897304892539978]
xla:0 Train... [epoch 4/4, step 400/938]	[loss 0.13403896987438202]
xla:0 Train... [epoch 4/4, step 500/938]	[loss 0.1135573536157608]
xla:0 Train... [epoch 4/4, step 600/938]	[loss 0.14964193105697632]
xla:0 Train... [epoch 4/4, step 700/938]	[loss 0.20239461958408356]
xla:0 Train... [epoch 4/4, step 800/938]	[loss 0.23625142872333527]
xla:0 Train... [epoch 4/4, step 900/938]	[loss 0.3418393135070801]
epoch = 3, accuracy = 90.11

多个服务器、多个程序间控制多个CPU/GPU/TPU

  • 详细内容请参考官方Example
  • 包括
    • 单服务器内,多个程序控制多个计算设备
    • 多个服务器间,多个程序控制多个计算设备
  • 写好代码后,请先在每个服务器下执行$ accelerate config生成对应的配置文件,下面是个样例
(huggingface) PS C:\Users\alion\temp> accelerate config
------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine
------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 2
------------------------------------------------------------------------------------------------------------------------What is the rank of this machine?
0
What is the IP address of the machine that will host the main process? 192.168.101
What is the port you will use to communicate with the main process? 12345
Are all the machines on the same local network? Answer `no` if nodes are on the cloud and/or on different network hosts [YES/no]: yes
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0
------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
fp16
accelerate configuration saved at C:\Users\alion/.cache\huggingface\accelerate\default_config.yaml
  • 最后在每个服务器启动训练脚本 $ accelerate launch example.py(如果你是单台服务器多个程序,那就只启动一台的脚本就完了)

参考链接

  • https://github.com/huggingface/accelerate
  • https://www.kaggle.com/code/muellerzr/multi-gpu-and-accelerate
  • https://github.com/huggingface/notebooks/blob/main/examples/accelerate_examples/simple_nlp_example.ipynb
  • https://github.com/huggingface/accelerate/tree/main/examples

这篇关于PyTorch——利用Accelerate轻松控制多个CPU/GPU/TPU加速计算的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/790400

相关文章

Go语言使用slices包轻松实现排序功能

《Go语言使用slices包轻松实现排序功能》在Go语言开发中,对数据进行排序是常见的需求,Go1.18版本引入的slices包提供了简洁高效的排序解决方案,支持内置类型和用户自定义类型的排序操作,本... 目录一、内置类型排序:字符串与整数的应用1. 字符串切片排序2. 整数切片排序二、检查切片排序状态:

Java计算经纬度距离的示例代码

《Java计算经纬度距离的示例代码》在Java中计算两个经纬度之间的距离,可以使用多种方法(代码示例均返回米为单位),文中整理了常用的5种方法,感兴趣的小伙伴可以了解一下... 目录1. Haversine公式(中等精度,推荐通用场景)2. 球面余弦定理(简单但精度较低)3. Vincenty公式(高精度,

PyTorch高级特性与性能优化方式

《PyTorch高级特性与性能优化方式》:本文主要介绍PyTorch高级特性与性能优化方式,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录一、自动化机制1.自动微分机制2.动态计算图二、性能优化1.内存管理2.GPU加速3.多GPU训练三、分布式训练1.分布式数据

MySQL精准控制Binlog日志数量的三种方案

《MySQL精准控制Binlog日志数量的三种方案》作为数据库管理员,你是否经常为服务器磁盘爆满而抓狂?Binlog就像数据库的“黑匣子”,默默记录着每一次数据变动,但若放任不管,几天内这些日志文件就... 目录 一招修改配置文件:永久生效的控制术1.定位my.cnf文件2.添加核心参数不重启热更新:高手应

POI从入门到实战轻松完成EasyExcel使用及Excel导入导出功能

《POI从入门到实战轻松完成EasyExcel使用及Excel导入导出功能》ApachePOI是一个流行的Java库,用于处理MicrosoftOffice格式文件,提供丰富API来创建、读取和修改O... 目录前言:Apache POIEasyPoiEasyExcel一、EasyExcel1.1、核心特性

Gradle在国内配置镜像加速的实现步骤

《Gradle在国内配置镜像加速的实现步骤》在国内使用Gradle构建项目时,最大的痛点就是依赖下载贼慢,甚至卡死,下面教你如何配置国内镜像加速Gradle下载依赖,主要是通过改写repositori... 目录引言一、修改 build.gradle 或 settings.gradle 的 reposito

windows和Linux使用命令行计算文件的MD5值

《windows和Linux使用命令行计算文件的MD5值》在Windows和Linux系统中,您可以使用命令行(终端或命令提示符)来计算文件的MD5值,文章介绍了在Windows和Linux/macO... 目录在Windows上:在linux或MACOS上:总结在Windows上:可以使用certuti

判断PyTorch是GPU版还是CPU版的方法小结

《判断PyTorch是GPU版还是CPU版的方法小结》PyTorch作为当前最流行的深度学习框架之一,支持在CPU和GPU(NVIDIACUDA)上运行,所以对于深度学习开发者来说,正确识别PyTor... 目录前言为什么需要区分GPU和CPU版本?性能差异硬件要求如何检查PyTorch版本?方法1:使用命

SpringBoot请求参数接收控制指南分享

《SpringBoot请求参数接收控制指南分享》:本文主要介绍SpringBoot请求参数接收控制指南,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录Spring Boot 请求参数接收控制指南1. 概述2. 有注解时参数接收方式对比3. 无注解时接收参数默认位置

Java中Switch Case多个条件处理方法举例

《Java中SwitchCase多个条件处理方法举例》Java中switch语句用于根据变量值执行不同代码块,适用于多个条件的处理,:本文主要介绍Java中SwitchCase多个条件处理的相... 目录前言基本语法处理多个条件示例1:合并相同代码的多个case示例2:通过字符串合并多个case进阶用法使用