LLM Fine-Tuning大模型FT方法

2023-11-29 06:30
文章标签 方法 模型 llm tuning fine ft

本文主要是介绍LLM Fine-Tuning大模型FT方法,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Here we discuss fine-tuning Llama 2 with a couple of different recipes. We will cover two scenarios here:

1. Parameter Efficient Model Fine-Tuning

This helps make the fine-tuning process more affordable even on 1 consumer grade GPU. These methods enable us to keep the whole model frozen and to just add tiny learnable parameters/ layers into the model. In this way, we just train a very tiny portion of the parameters. The most famous method in this category is LORA, LLaMA Adapter and Prefix-tuning.

These methods will address three aspects:

  1. Cost of full fine-tuning – these methods only train a small set of extra parameters instead of the full model, this makes it possible to run these on consumer GPUs.

  2. Cost of deployment – for each fine-tuned downstream model we need to deploy a separate model; however, when using these methods, only a small set of parameters (few MB instead of several GBs) of the pretrained model can do the job. In this case, for each task we only add these extra parameters on top of the pretrained model so pretrained models can be assumed as backbone and these parameters as heads for the model on different tasks.

  3. Catastrophic forgetting — these methods also help with forgetting the first task that can happen in fine-tunings.

HF PEFT library provides an easy way of using these methods which we make use of here. Please read more here.

2. Full/ Partial Parameter Fine-Tuning

Full parameter fine-tuning has its own advantages, in this method there are multiple strategies that can help:

  1. Keep the pretrained model frozen and only fine-tune the task head for example, the classifier model.

  2. Keep the pretrained model frozen and add a few fully connected layers on the top.

  3. Fine-tuning on all the layers.

You can also keep most of the layers frozen and only fine-tune a few layers. There are many different techniques to choose from to freeze/unfreeze layers based on different criteria.

在这里插入图片描述

In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Llama 2 7B parameter won’t fit into one gpu. The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes). For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.

FSDP (Fully Sharded Data Parallel)

Pytorch has the FSDP package for training models that do not fit into one GPU. FSDP lets you train a much larger model with the same amount of resources. Prior to FSDP was DDP (Distributed Data Parallel) where each GPU was holding a full replica of the model and would only shard the data. At the end of backward pass it would sync up the gradients.

FSDP extends this idea, not only sharding the data but also model parameters, gradients and optimizer states. This means each GPU will only keep one shard of the model. This will result in huge memory savings that enable us to fit a much larger model into the same number of GPU. As an example in DDP the most you could fit into a GPU with 16GB memory is a model around 700M parameters. So, suppose you had 4 GPUs, in this case even though you access 4 GPUs, you still can’t scale beyond the model size that can fit into one GPU. However with FSDP you can fit a 3B model into 4 GPUs, > 4x larger model.

Please read more on FSDP here & get started with FSDP here.

To boost the performance of fine-tuning with FSDP, we can make use a number of features such as:

  1. Mixed Precision which in FSDP is much more flexible compared to Autocast. It gives user control over setting precision for model parameters, buffers and gradients.

  2. Activation Checkpointing which is a technique to save memory by discarding the intermediate activation in forward pass instead of keeping it in the memory with the cost recomputing them in the backward pass. FSDP Activation checkpointing is shard aware meaning we need to apply it after wrapping the model with FSDP. In our script we are making use of that.

  3. auto_wrap_policy Which is the way to specify how FSDP would partition the model, there is default support for transformer wrapping policy. This allows FSDP to form each FSDP unit ( partition of the model ) based on the transformer class in the model. To identify this layer in the model, you need to look at the layer that wraps both the attention layer and MLP. This helps FSDP have more fine-grained units for communication that help with optimizing the communication cost.

这篇关于LLM Fine-Tuning大模型FT方法的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/431794

相关文章

Java中流式并行操作parallelStream的原理和使用方法

《Java中流式并行操作parallelStream的原理和使用方法》本文详细介绍了Java中的并行流(parallelStream)的原理、正确使用方法以及在实际业务中的应用案例,并指出在使用并行流... 目录Java中流式并行操作parallelStream0. 问题的产生1. 什么是parallelS

MySQL数据库双机热备的配置方法详解

《MySQL数据库双机热备的配置方法详解》在企业级应用中,数据库的高可用性和数据的安全性是至关重要的,MySQL作为最流行的开源关系型数据库管理系统之一,提供了多种方式来实现高可用性,其中双机热备(M... 目录1. 环境准备1.1 安装mysql1.2 配置MySQL1.2.1 主服务器配置1.2.2 从

Python版本信息获取方法详解与实战

《Python版本信息获取方法详解与实战》在Python开发中,获取Python版本号是调试、兼容性检查和版本控制的重要基础操作,本文详细介绍了如何使用sys和platform模块获取Python的主... 目录1. python版本号获取基础2. 使用sys模块获取版本信息2.1 sys模块概述2.1.1

Python实现字典转字符串的五种方法

《Python实现字典转字符串的五种方法》本文介绍了在Python中如何将字典数据结构转换为字符串格式的多种方法,首先可以通过内置的str()函数进行简单转换;其次利用ison.dumps()函数能够... 目录1、使用json模块的dumps方法:2、使用str方法:3、使用循环和字符串拼接:4、使用字符

Python版本与package版本兼容性检查方法总结

《Python版本与package版本兼容性检查方法总结》:本文主要介绍Python版本与package版本兼容性检查方法的相关资料,文中提供四种检查方法,分别是pip查询、conda管理、PyP... 目录引言为什么会出现兼容性问题方法一:用 pip 官方命令查询可用版本方法二:conda 管理包环境方法

Linux云服务器手动配置DNS的方法步骤

《Linux云服务器手动配置DNS的方法步骤》在Linux云服务器上手动配置DNS(域名系统)是确保服务器能够正常解析域名的重要步骤,以下是详细的配置方法,包括系统文件的修改和常见问题的解决方案,需要... 目录1. 为什么需要手动配置 DNS?2. 手动配置 DNS 的方法方法 1:修改 /etc/res

JavaScript对象转数组的三种方法实现

《JavaScript对象转数组的三种方法实现》本文介绍了在JavaScript中将对象转换为数组的三种实用方法,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友... 目录方法1:使用Object.keys()和Array.map()方法2:使用Object.entr

SpringBoot中ResponseEntity的使用方法举例详解

《SpringBoot中ResponseEntity的使用方法举例详解》ResponseEntity是Spring的一个用于表示HTTP响应的全功能对象,它可以包含响应的状态码、头信息及响应体内容,下... 目录一、ResponseEntity概述基本特点:二、ResponseEntity的基本用法1. 创

java中判断json key是否存在的几种方法

《java中判断jsonkey是否存在的几种方法》在使用Java处理JSON数据时,如何判断某一个key是否存在?本文就来介绍三种方法,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的... 目http://www.chinasem.cn录第一种方法是使用 jsONObject 的 has 方法

java中ssh2执行多条命令的四种方法

《java中ssh2执行多条命令的四种方法》本文主要介绍了java中ssh2执行多条命令的四种方法,包括分号分隔、管道分隔、EOF块、脚本调用,可确保环境配置生效,提升操作效率,具有一定的参考价值,感... 目录1 使用分号隔开2 使用管道符号隔开3 使用写EOF的方式4 使用脚本的方式大家平时有没有遇到自