#RAG##AIGC#检索增强生成 (RAG) 基本介绍和入门实操示例

本文主要是介绍#RAG##AIGC#检索增强生成 (RAG) 基本介绍和入门实操示例,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

本文包括RAG基本介绍和入门实操示例

RAG 基本介绍

通用语言模型可以进行微调以实现一些常见任务,例如情感分析和命名实体识别。这些任务通常不需要额外的背景知识。
对于更复杂和知识密集型的任务,可以构建基于语言模型的系统来访问外部知识源来完成任务。这使得事实更加一致,提高了生成响应的可靠性,并有助于减轻“幻觉”问题。
Meta AI 研究人员推出了一种称为检索增强生成(RAG)的方法来解决此类知识密集型任务。 RAG 将信息检索组件与文本生成器模型相结合。 RAG 可以进行微调,并且可以有效地修改其内部知识,而无需重新训练整个模型。
RAG 接受输入并检索一组给定来源(例如维基百科)的相关/支持文档。这些文档作为上下文与原始输入提示连接起来,并输入到生成最终输出的文本生成器。这使得 RAG 能够适应事实可能随时间变化的情况。这非常有用,因为 LLMs 的参数知识是静态的。 RAG 允许语言模型绕过再训练,从而能够访问最新信息,从而通过基于检索的生成来生成可靠的输出。
Lewis 等人 (2021) 提出了一种 RAG 的通用微调方法。预训练的 seq2seq 模型用作参数存储器,维基百科的密集向量索引用作非参数存储器(使用神经预训练检索器访问)。下面概述了该方法的工作原理:
请添加图片描述
RAG 在自然问题等多项基准测试中表现强劲, WebQuestions , 网络问题,和 CuratedTrec。在针对 MS-MARCO 和 Jeopardy 问题进行测试时,RAG 生成的答案更加真实、具体和多样化。 RAG 还改进了 FEVER 事实验证的结果。
这表明 RAG 作为增强知识密集型任务中语言模型输出的可行选择的潜力。
最近,这些基于检索器的方法变得越来越流行,并与 ChatGPT 等流行的 LLMs 相结合,以提高功能和事实一致性。

RAG 用例:生成友好的 ML 论文标题

下面,我们准备了一个笔记本教程,展示如何使用开源 LLMs 构建 RAG 系统来生成简短的机器学习论文标题:

RAG入门

虽然大型语言模型(LLM)显示出强大的功能来支持高级用例,但它们也存在事实不一致和幻觉等问题。检索增强生成(RAG)是丰富LLM能力和提高其可靠性的一种强大方法。
RAG涉及通过用有助于完成任务的相关信息丰富提示上下文,将LLM与外部知识相结合。

本教程展示了如何利用矢量存储和开源LLM开始使用RAG。
为了展示RAG的强大功能,本用例将涵盖构建一个RAG系统,该系统从原始ML论文标题中建议简短且易于阅读的ML论文标题。

对于普通受众来说,纸质资料可能过于技术化,因此使用RAG在之前创建的短标题的基础上生成小标题可以使研究论文标题更容易访问,并用于科学传播,如以时事通讯或博客的形式。

在开始之前,让我们先安装我们将要使用的库:

%%capture
!pip install chromadb tqdm fireworks-ai python-dotenv pandas
!pip install sentence-transformers

在继续之前,您需要获得Fireworks API密钥才能使用Mistral 7B模型。

查看此快速指南以获取您的Fireworks API密钥:: https://readme.fireworks.ai/docs

import fireworks.client
import os
import dotenv
import chromadb
import json
from tqdm.auto import tqdm
import pandas as pd
import random# 可以使用 Colab secrets 设置环境
dotenv.load_dotenv()fireworks.client.api_key = os.getenv("FIREWORKS_API_KEY")

开始

让我们定义一个函数,从Fireworks推理平台获取完成。

def get_completion(prompt, model=None, max_tokens=50):fw_model_dir = "accounts/fireworks/models/"if model is None:model = fw_model_dir + "llama-v2-7b"else:model = fw_model_dir + modelcompletion = fireworks.client.Completion.create(model=model,prompt=prompt,max_tokens=max_tokens,temperature=0)return completion.choices[0].text

让我们首先尝试一个简单提示的函数:

get_completion("Hello, my name is")
' Katie and I am a 20 year old student at the University of Leeds. I am currently studying a BA in English Literature and Creative Writing. I have been working as a tutor for over 3 years now and I'

现在让我们使用Mistral-7B-指令进行测试:

mistral_llm = "mistral-7b-instruct-4k"get_completion("Hello, my name is", model=mistral_llm)
' [Your Name]. I am a [Your Profession/Occupation]. I am writing to [Purpose of Writing].\n\nI am writing to [Purpose of Writing] because [Reason for Writing]. I believe that ['

Mistral 7B指令模型需要使用特殊的指令标记 [INST] <instruction> [/INST] 进行指令,以获得正确的行为。您可以在此处找到有关如何提示Mistral 7B指令的更多说明:: https://docs.mistral.ai/llm/mistral-instruct-v0.1

mistral_llm = "mistral-7b-instruct-4k"get_completion("Tell me 2 jokes", model=mistral_llm)
".\n1. Why don't scientists trust atoms? Because they make up everything!\n2. Did you hear about the mathematician who’s afraid of negative numbers? He will stop at nothing to avoid them."
mistral_llm = "mistral-7b-instruct-4k"get_completion("[INST]Tell me 2 jokes[/INST]", model=mistral_llm)
" Sure, here are two jokes for you:\n\n1. Why don't scientists trust atoms? Because they make up everything!\n2. Why did the tomato turn red? Because it saw the salad dressing!"

现在,让我们尝试使用一个更复杂的提示,其中包含说明:

prompt = """[INST]
Given the following wedding guest data, write a very short 3-sentences thank you letter:{"name": "John Doe","relationship": "Bride's cousin","hometown": "New York, NY","fun_fact": "Climbed Mount Everest in 2020","attending_with": "Sophia Smith","bride_groom_name": "Tom and Mary"
}Use only the data provided in the JSON object above.The senders of the letter is the bride and groom, Tom and Mary.
[/INST]"""get_completion(prompt, model=mistral_llm, max_tokens=150)
" Dear John Doe,\n\nWe, Tom and Mary, would like to extend our heartfelt gratitude for your attendance at our wedding. It was a pleasure to have you there, and we truly appreciate the effort you made to be a part of our special day.\n\nWe were thrilled to learn about your fun fact - climbing Mount Everest is an incredible accomplishment! We hope you had a safe and memorable journey.\n\nThank you again for joining us on this special occasion. We hope to stay in touch and catch up on all the amazing things you've been up to.\n\nWith love,\n\nTom and Mary"

RAG用例:生成短文标题

对于RAG用例,我们将使用a dataset 其中包含每周热门ML论文的列表。

用户将提供原始论文标题。然后,我们将接受该输入,然后使用数据集生成简短而吸引人的论文标题的上下文,这将有助于为原始输入标题生成吸引人的标题。

步骤1:加载数据集

让我们首先加载我们将使用的数据集:

# load dataset from data/ folder to pandas dataframe
# dataset contains column namesml_papers = pd.read_csv("../data/ml-potw-10232023.csv", header=0)# remove rows with empty titles or descriptions
ml_papers = ml_papers.dropna(subset=["Title", "Description"])
ml_papers.head()
TitleDescriptionPaperURLTweetURLAbstract
0Llemmaan LLM for mathematics which is based on conti...https://arxiv.org/abs/2310.10631https://x.com/zhangir_azerbay/status/171409802...We present Llemma, a large language model for ...
1LLMs for Software Engineeringa comprehensive survey of LLMs for software en...https://arxiv.org/abs/2310.03533https://x.com/omarsar0/status/1713940983199506...This paper provides a survey of the emerging a...
2Self-RAGpresents a new retrieval-augmented framework t...https://arxiv.org/abs/2310.11511https://x.com/AkariAsai/status/171511027707796...Despite their remarkable capabilities, large l...
3Retrieval-Augmentation for Long-form Question ...explores retrieval-augmented language models o...https://arxiv.org/abs/2310.12150https://x.com/omarsar0/status/1714986431859282...We present a study of retrieval-augmented lang...
4GenBenchpresents a framework for characterizing and un...https://www.nature.com/articles/s42256-023-007...https://x.com/AIatMeta/status/1715041427283902...NaN
# convert dataframe to list of dicts with Title and Description columns onlyml_papers_dict = ml_papers.to_dict(orient="records")
ml_papers_dict[0]
{'Title': 'Llemma','Description': 'an LLM for mathematics which is based on continued pretraining from Code Llama on the Proof-Pile-2 dataset; the dataset involves scientific paper, web data containing mathematics, and mathematical code; Llemma outperforms open base models and the unreleased Minerva on the MATH benchmark; the model is released, including dataset and code to replicate experiments.','PaperURL': 'https://arxiv.org/abs/2310.10631','TweetURL': 'https://x.com/zhangir_azerbay/status/1714098025956864031?s=20','Abstract': 'We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.'}

我们将使用PenceTransformer生成嵌入,并将其存储到Chroma文档存储中。

from chromadb import Documents, EmbeddingFunction, Embeddings
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')class MyEmbeddingFunction(EmbeddingFunction):def __call__(self, input: Documents) -> Embeddings:batch_embeddings = embedding_model.encode(input)return batch_embeddings.tolist()embed_fn = MyEmbeddingFunction()# Initialize the chromadb directory, and client.
client = chromadb.PersistentClient(path="./chromadb")# create collection
collection = client.get_or_create_collection(name=f"ml-papers-nov-2023"
)
.gitattributes: 100%|██████████| 1.18k/1.18k [00:00<00:00, 194kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 204kB/s]
README.md: 100%|██████████| 10.6k/10.6k [00:00<00:00, 7.64MB/s]
config.json: 100%|██████████| 612/612 [00:00<00:00, 679kB/s]
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 94.0kB/s]
data_config.json: 100%|██████████| 39.3k/39.3k [00:00<00:00, 7.80MB/s]
pytorch_model.bin: 100%|██████████| 90.9M/90.9M [00:03<00:00, 24.3MB/s]
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 55.4kB/s]
special_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 161kB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 6.15MB/s]
tokenizer_config.json: 100%|██████████| 350/350 [00:00<00:00, 286kB/s]
train_script.py: 100%|██████████| 13.2k/13.2k [00:00<00:00, 12.2MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 9.15MB/s]
modules.json: 100%|██████████| 349/349 [00:00<00:00, 500kB/s]

我们现在将为批生成嵌入:

# Generate embeddings, and index titles in batches
batch_size = 50# loop through batches and generated + store embeddings
for i in tqdm(range(0, len(ml_papers_dict), batch_size)):i_end = min(i + batch_size, len(ml_papers_dict))batch = ml_papers_dict[i : i + batch_size]# Replace title with "No Title" if empty stringbatch_titles = [str(paper["Title"]) if str(paper["Title"]) != "" else "No Title" for paper in batch]batch_ids = [str(sum(ord(c) + random.randint(1, 10000) for c in paper["Title"])) for paper in batch]batch_metadata = [dict(url=paper["PaperURL"],abstract=paper['Abstract'])for paper in batch]# generate embeddingsbatch_embeddings = embedding_model.encode(batch_titles)# upsert to chromadbcollection.upsert(ids=batch_ids,metadatas=batch_metadata,documents=batch_titles,embeddings=batch_embeddings.tolist(),)
100%|██████████| 9/9 [00:01<00:00,  7.62it/s]

现在我们可以测试寻回器:

collection = client.get_or_create_collection(name=f"ml-papers-nov-2023",embedding_function=embed_fn
)retriever_results = collection.query(query_texts=["Software Engineering"],n_results=2,
)print(retriever_results["documents"])
[['LLMs for Software Engineering', 'Communicative Agents for Software Development']]

现在,让我们总结一下最后的提示:

# user query
user_query = "S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models"# query for user query
results = collection.query(query_texts=[user_query],n_results=10,
)# concatenate titles into a single string
short_titles = '\n'.join(results['documents'][0])prompt_template = f'''[INST]Your main task is to generate 5 SUGGESTED_TITLES based for the PAPER_TITLEYou should mimic a similar style and length as SHORT_TITLES but PLEASE DO NOT include titles from SHORT_TITLES in the SUGGESTED_TITLES, only generate versions of the PAPER_TILE.PAPER_TITLE: {user_query}SHORT_TITLES: {short_titles}SUGGESTED_TITLES:[/INST]
'''responses = get_completion(prompt_template, model=mistral_llm, max_tokens=2000)
suggested_titles = ''.join([str(r) for r in responses])# Print the suggestions.
print("Model Suggestions:")
print(suggested_titles)
print("\n\n\nPrompt Template:")
print(prompt_template)
Model Suggestions:1. S3Eval: A Comprehensive Evaluation Suite for Large Language Models
2. Synthetic and Scalable Evaluation for Large Language Models
3. Systematic Evaluation of Large Language Models with S3Eval
4. S3Eval: A Synthetic and Scalable Approach to Language Model Evaluation
5. S3Eval: A Synthetic and Scalable Evaluation Suite for Large Language ModelsPrompt Template:
[INST]Your main task is to generate 5 SUGGESTED_TITLES based for the PAPER_TITLEYou should mimic a similar style and length as SHORT_TITLES but PLEASE DO NOT include titles from SHORT_TITLES in the SUGGESTED_TITLES, only generate versions of the PAPER_TILE.PAPER_TITLE: S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language ModelsSHORT_TITLES: Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
ChemCrow: Augmenting large-language models with chemistry tools
A Survey of Large Language Models
LLaMA: Open and Efficient Foundation Language Models
SparseGPT: Massive Language Models Can Be Accurately Pruned In One-Shot
REPLUG: Retrieval-Augmented Black-Box Language Models
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Auditing large language models: a three-layered approach
Fine-Tuning Language Models with Just Forward Passes
DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving AgentsSUGGESTED_TITLES:[/INST]

正如您所看到的,LLM生成的简短标题在某种程度上是可以的。这个用例仍然需要做更多的工作,并且可能也会从微调中受益。为了本教程的目的,我们使用Firework的开源模型提供了一个简单的RAG应用程序

在这里尝试其他开源模型: https://app.fireworks.ai/models

R点击此处了解有关Fireworks API的更多信息: https://readme.fireworks.ai/reference/createchatcompletion

参考文献

· https://arxiv.org/abs/2312.10997 Retrieval-Augmented Generation for Large Language Models: A Survey
大型语言模型的检索增强生成:一项调查
· https://ai.meta.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/ Retrieval Augmented Generation: Streamlining the creation of intelligent natural language processing models
检索增强生成:简化智能自然语言处理模型的创建
· https://arxiv.org/abs/2302.07842 Augmented Language Models: a Survey
增强语言模型:调查

这篇关于#RAG##AIGC#检索增强生成 (RAG) 基本介绍和入门实操示例的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/611680

相关文章

史上最全MybatisPlus从入门到精通

《史上最全MybatisPlus从入门到精通》MyBatis-Plus是MyBatis增强工具,简化开发并提升效率,支持自动映射表名/字段与实体类,提供条件构造器、多种查询方式(等值/范围/模糊/分页... 目录1.简介2.基础篇2.1.通用mapper接口操作2.2.通用service接口操作3.进阶篇3

C++归并排序代码实现示例代码

《C++归并排序代码实现示例代码》归并排序将待排序数组分成两个子数组,分别对这两个子数组进行排序,然后将排序好的子数组合并,得到排序后的数组,:本文主要介绍C++归并排序代码实现的相关资料,需要的... 目录1 算法核心思想2 代码实现3 算法时间复杂度1 算法核心思想归并排序是一种高效的排序方式,需要用

C#中的Drawing 类案例详解

《C#中的Drawing类案例详解》文章解析WPF与WinForms的Drawing类差异,涵盖命名空间、继承链、常用类及应用场景,通过案例展示如何创建带阴影圆角矩形按钮,强调WPF的轻量、可动画特... 目录一、Drawing 是什么?二、典型用法三、案例:画一个“带阴影的圆角矩形按钮”四、WinForm

MySQL ORDER BY 语句常见用法、示例详解

《MySQLORDERBY语句常见用法、示例详解》ORDERBY是结构化查询语言(SQL)中的关键字,隶属于SELECT语句的子句结构,用于对查询结果集按指定列进行排序,本文给大家介绍MySQL... 目录mysql ORDER BY 语句详细说明1.基本语法2.排序方向详解3.多列排序4.常见用法示例5.

Python自定义异常的全面指南(入门到实践)

《Python自定义异常的全面指南(入门到实践)》想象你正在开发一个银行系统,用户转账时余额不足,如果直接抛出ValueError,调用方很难区分是金额格式错误还是余额不足,这正是Python自定义异... 目录引言:为什么需要自定义异常一、异常基础:先搞懂python的异常体系1.1 异常是什么?1.2

SQLServer中生成雪花ID(Snowflake ID)的实现方法

《SQLServer中生成雪花ID(SnowflakeID)的实现方法》:本文主要介绍在SQLServer中生成雪花ID(SnowflakeID)的实现方法,文中通过示例代码介绍的非常详细,... 目录前言认识雪花ID雪花ID的核心特点雪花ID的结构(64位)雪花ID的优势雪花ID的局限性雪花ID的应用场景

C#之枚举类型与随机数详解

《C#之枚举类型与随机数详解》文章讲解了枚举类型的定义与使用方法,包括在main外部声明枚举,用于表示游戏状态和周几状态,枚举值默认从0开始递增,也可手动设置初始值以生成随机数... 目录枚举类型1.定义枚举类型(main外)2.使用生成随机数总结枚举类型1.定义枚举类型(main外)enum 类型名字

SpringBoot集成Shiro+JWT(Hutool)完整代码示例

《SpringBoot集成Shiro+JWT(Hutool)完整代码示例》ApacheShiro是一个强大且易用的Java安全框架,提供了认证、授权、加密和会话管理功能,在现代应用开发中,Shiro因... 目录一、背景介绍1.1 为什么使用Shiro?1.2 为什么需要双Token?二、技术栈组成三、环境

Java 与 LibreOffice 集成开发指南(环境搭建及代码示例)

《Java与LibreOffice集成开发指南(环境搭建及代码示例)》本文介绍Java与LibreOffice的集成方法,涵盖环境配置、API调用、文档转换、UNO桥接及REST接口等技术,提供... 目录1. 引言2. 环境搭建2.1 安装 LibreOffice2.2 配置 Java 开发环境2.3 配

DNS查询的利器! linux的dig命令基本用法详解

《DNS查询的利器!linux的dig命令基本用法详解》dig命令可以查询各种类型DNS记录信息,下面我们将通过实际示例和dig命令常用参数来详细说明如何使用dig实用程序... dig(Domain Information Groper)是一款功能强大的 linux 命令行实用程序,通过查询名称服务器并输