Qdrant官方快速入门和教程简化版

2024-08-29 04:36

本文主要是介绍Qdrant官方快速入门和教程简化版,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Qdrant官方快速入门和教程简化版

说明:

  • 首次发表日期:2024-08-28
  • Qdrant官方文档:https://qdrant.tech/documentation/

关于

阅读Qdrant一小部分的官方文档,并使用中文简化记录下,更多请阅读官方文档。

使用Docker本地部署Qdrant

docker pull qdrant/qdrant
docker run -d -p 6333:6333 -p 6334:6334 \-v $(pwd)/qdrant_storage:/qdrant/storage:z \qdrant/qdrant

默认配置下,所有的数据存储在./qdrant_storage

快速入门

安装qdrant-client包(python):

pip install qdrant-client

初始化客户端:

from qdrant_client import QdrantClientclient = QdrantClient(url="http://localhost:6333")

所有的向量数据(vector data)都存储在Qdrant Collection上。创建一个名为test_collection的collection,该collection使用dot product作为比较向量的指标。

from qdrant_client.models import Distance, VectorParamsclient.create_collection(collection_name="test_collection",vectors_config=VectorParams(size=4, distance=Distance.DOT),
)

添加带payload的向量。payload是与向量相关联的数据。

from qdrant_client.models import PointStructoperation_info = client.upsert(collection_name="test_collection",wait=True,points=[PointStruct(id=1, vector=[0.05, 0.61, 0.76, 0.74], payload={"city": "Berlin"}),PointStruct(id=2, vector=[0.19, 0.81, 0.75, 0.11], payload={"city": "London"}),PointStruct(id=3, vector=[0.36, 0.55, 0.47, 0.94], payload={"city": "Moscow"}),PointStruct(id=4, vector=[0.18, 0.01, 0.85, 0.80], payload={"city": "New York"}),PointStruct(id=5, vector=[0.24, 0.18, 0.22, 0.44], payload={"city": "Beijing"}),PointStruct(id=6, vector=[0.35, 0.08, 0.11, 0.44], payload={"city": "Mumbai"}),]
)print(operation_info)

运行一个查询:

search_result = client.query_points(collection_name="test_collection", query=[0.2, 0.1, 0.9, 0.7], limit=3
).pointsprint(search_result)

输出:

[{"id": 4,"version": 0,"score": 1.362,"payload": null,"vector": null},{"id": 1,"version": 0,"score": 1.273,"payload": null,"vector": null},{"id": 3,"version": 0,"score": 1.208,"payload": null,"vector": null}
]

添加一个过滤器:

from qdrant_client.models import Filter, FieldCondition, MatchValuesearch_result = client.query_points(collection_name="test_collection",query=[0.2, 0.1, 0.9, 0.7],query_filter=Filter(must=[FieldCondition(key="city", match=MatchValue(value="London"))]),with_payload=True,limit=3,
).pointsprint(search_result)

输出:

[{"id": 2,"version": 0,"score": 0.871,"payload": {"city": "London"},"vector": null}
]

教程

语义搜索入门

安装依赖:

pip install sentence-transformers

导入模块:

from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

使用all-MiniLM-L6-v2编码器作为embedding模型,embedding模型可以将raw data转化为embeddings)

encoder = SentenceTransformer("all-MiniLM-L6-v2")

添加数据集:

documents = [{"name": "The Time Machine","description": "A man travels through time and witnesses the evolution of humanity.","author": "H.G. Wells","year": 1895,},{"name": "Ender's Game","description": "A young boy is trained to become a military leader in a war against an alien race.","author": "Orson Scott Card","year": 1985,},{"name": "Brave New World","description": "A dystopian society where people are genetically engineered and conditioned to conform to a strict social hierarchy.","author": "Aldous Huxley","year": 1932,},{"name": "The Hitchhiker's Guide to the Galaxy","description": "A comedic science fiction series following the misadventures of an unwitting human and his alien friend.","author": "Douglas Adams","year": 1979,},{"name": "Dune","description": "A desert planet is the site of political intrigue and power struggles.","author": "Frank Herbert","year": 1965,},{"name": "Foundation","description": "A mathematician develops a science to predict the future of humanity and works to save civilization from collapse.","author": "Isaac Asimov","year": 1951,},{"name": "Snow Crash","description": "A futuristic world where the internet has evolved into a virtual reality metaverse.","author": "Neal Stephenson","year": 1992,},{"name": "Neuromancer","description": "A hacker is hired to pull off a near-impossible hack and gets pulled into a web of intrigue.","author": "William Gibson","year": 1984,},{"name": "The War of the Worlds","description": "A Martian invasion of Earth throws humanity into chaos.","author": "H.G. Wells","year": 1898,},{"name": "The Hunger Games","description": "A dystopian society where teenagers are forced to fight to the death in a televised spectacle.","author": "Suzanne Collins","year": 2008,},{"name": "The Andromeda Strain","description": "A deadly virus from outer space threatens to wipe out humanity.","author": "Michael Crichton","year": 1969,},{"name": "The Left Hand of Darkness","description": "A human ambassador is sent to a planet where the inhabitants are genderless and can change gender at will.","author": "Ursula K. Le Guin","year": 1969,},{"name": "The Three-Body Problem","description": "Humans encounter an alien civilization that lives in a dying system.","author": "Liu Cixin","year": 2008,},
]

将embedding数据存储在内存中:

client = QdrantClient(":memory:")

创建一个collection:

client.create_collection(collection_name="my_books",vectors_config=models.VectorParams(size=encoder.get_sentence_embedding_dimension(),  # Vector size is defined by used modeldistance=models.Distance.COSINE,),
)

上传数据:

client.upload_points(collection_name="my_books",points=[models.PointStruct(id=idx, vector=encoder.encode(doc["description"]).tolist(), payload=doc)for idx, doc in enumerate(documents)],
)

问一个问题:

hits = client.query_points(collection_name="my_books",query=encoder.encode("alien invasion").tolist(),limit=3,
).pointsfor hit in hits:print(hit.payload, "score:", hit.score)

输出:

{'name': 'The War of the Worlds', 'description': 'A Martian invasion of Earth throws humanity into chaos.', 'author': 'H.G. Wells', 'year': 1898} score: 0.570093257022374
{'name': "The Hitchhiker's Guide to the Galaxy", 'description': 'A comedic science fiction series following the misadventures of an unwitting human and his alien friend.', 'author': 'Douglas Adams', 'year': 1979} score: 0.5040468703143637
{'name': 'The Three-Body Problem', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'author': 'Liu Cixin', 'year': 2008} score: 0.45902943411768216

过滤以便缩窄查询:

hits = client.query_points(collection_name="my_books",query=encoder.encode("alien invasion").tolist(),query_filter=models.Filter(must=[models.FieldCondition(key="year", range=models.Range(gte=2000))]),limit=1,
).pointsfor hit in hits:print(hit.payload, "score:", hit.score)

输出:

{'name': 'The Three-Body Problem', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'author': 'Liu Cixin', 'year': 2008} score: 0.45902943411768216

简单的神经搜索

下载样本数据集:

wget https://storage.googleapis.com/generall-shared-data/startups_demo.json

安装SentenceTransformer等依赖库:

pip install sentence-transformers numpy pandas tqdm

导入模块:

from sentence_transformers import SentenceTransformer
import numpy as np
import json
import pandas as pd
from tqdm.notebook import tqdm

创建sentence encoder:

model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda"
)  # or device="cpu" if you don't have a GPU

读取数据:

df = pd.read_json("./startups_demo.json", lines=True)

为每一个description创建embedding向量。encode内部会将输入切分为一个个batch,以便提高处理速度。

vectors = model.encode([row.alt + ". " + row.description for row in df.itertuples()],show_progress_bar=True,
)
vectors.shape
# > (40474, 384)

保存为npy文件:

np.save("startup_vectors.npy", vectors, allow_pickle=False)

启动docker服务

docker pull qdrant/qdrant
docker run -p 6333:6333 \-v $(pwd)/qdrant_storage:/qdrant/storage \qdrant/qdrant

创建Qdrant客户端

# Import client library
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distanceclient = QdrantClient("http://localhost:6333")

创建collection,其中384是embedding模型(all-MiniLM-L6-v2)的输出维度。

if not client.collection_exists("startups"):client.create_collection(collection_name="startups",vectors_config=VectorParams(size=384, distance=Distance.COSINE),)

加载数据

fd = open("./startups_demo.json")# payload is now an iterator over startup data
payload = map(json.loads, fd)# Load all vectors into memory, numpy array works as iterable for itself.
# Other option would be to use Mmap, if you don't want to load all data into RAM
vectors = np.load("./startup_vectors.npy")

上传数据到Qdrant

client.upload_collection(collection_name="startups",vectors=vectors,payload=payload,ids=None,  # Vector ids will be assigned automaticallybatch_size=256,  # How many vectors will be uploaded in a single request?
)

创建neural_searcher.py文件:

from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformerclass NeuralSearcher:def __init__(self, collection_name):self.collection_name = collection_name# Initialize encoder modelself.model = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")# initializa Qdrant clientself.qdrant_client = QdrantClient("http://localhost:6333")def search(self, text:str):# Convert text query into vectorvector = self.model.encode(text).tolist()# Use `vector` for search for closet vectors in the collectionsearch_result = self.qdrant_client.search(collection_name=self.collection_name,query_vector=vector,query_filter=None, # If you don't want any filters for nowlimit=5, # 5 the most closet results is enough)# `search_result` contains found vector ids with similarity scores along with stored payload# In this function you are interested in payload onlypayloads = [hit.payload for hit in search_result]return payloads

使用FastAPI部署:

pip install fastapi uvicorn
from qdrant_client import QdrantClient
from qdrant_client.models import Filter
from sentence_transformers import SentenceTransformerclass NeuralSearcher:def __init__(self, collection_name):self.collection_name = collection_name# Initialize encoder modelself.model = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")# initializa Qdrant clientself.qdrant_client = QdrantClient("http://localhost:6333")def search(self, text:str):# Convert text query into vectorvector = self.model.encode(text).tolist()# Use `vector` for search for closet vectors in the collectionsearch_result = self.qdrant_client.search(collection_name=self.collection_name,query_vector=vector,query_filter=None, # If you don't want any filters for nowlimit=5, # 5 the most closet results is enough)# `search_result` contains found vector ids with similarity scores along with stored payload# In this function you are interested in payload onlypayloads = [hit.payload for hit in search_result]return payloadsdef search_in_berlin(self, text:str):# Convert text query into vectorvector = self.model.encode(text).tolist()city_of_interest = "Berlin"# Define a filter for citiescity_filter = Filter(**{"must": [{"key": "city", # Store city information in a field of the same name "match": { # This condition checks if payload field has the requested value"value": city_of_interest}}]})# Use `vector` for search for closet vectors in the collectionsearch_result = self.qdrant_client.query_points(collection_name=self.collection_name,query=vector,query_filter=city_filter,limit=5,).points# `search_result` contains found vector ids with similarity scores along with stored payload# In this function you are interested in payload onlypayloads = [hit.payload for hit in search_result]return payloads
from fastapi import FastAPIapp = FastAPI()# Create a neural searcher instance
neural_searcher = NeuralSearcher(collection_name="startups")@app.get("/api/search")
def search_startup(q: str):return {"result": neural_searcher.search(text=q)}@app.get("/api/search_in_berlin")
def search_startup_filter(q: str):return {"result": neural_searcher.search_in_berlin(text=q)}if __name__ == "__main__":import uvicornuvicorn.run(app, host="0.0.0.0", port=8001)

如果是在jupyter notebook中运行,则需要添加

import nest_asyncio
nest_asyncio.apply()

安装nest_asyncio:

pip install nest_asyncio

异步使用Qdrant

Qdrant原生支持async

from qdrant_client import modelsimport qdrant_client
import asyncioasync def main():client = qdrant_client.AsyncQdrantClient("localhost")# Create a collectionawait client.create_collection(collection_name="my_collection",vectors_config=models.VectorParams(size=4, distance=models.Distance.COSINE),)# Insert a vectorawait client.upsert(collection_name="my_collection",points=[models.PointStruct(id="5c56c793-69f3-4fbf-87e6-c4bf54c28c26",payload={"color": "red",},vector=[0.9, 0.1, 0.1, 0.5],),],)# Search for nearest neighborspoints = await client.query_points(collection_name="my_collection",query=[0.9, 0.1, 0.1, 0.5],limit=2,).points# Your async code using AsyncQdrantClient might be put here# ...asyncio.run(main())

这篇关于Qdrant官方快速入门和教程简化版的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1116853

相关文章

从入门到精通详解Python虚拟环境完全指南

《从入门到精通详解Python虚拟环境完全指南》Python虚拟环境是一个独立的Python运行环境,它允许你为不同的项目创建隔离的Python环境,下面小编就来和大家详细介绍一下吧... 目录什么是python虚拟环境一、使用venv创建和管理虚拟环境1.1 创建虚拟环境1.2 激活虚拟环境1.3 验证虚

基于C#实现PDF转图片的详细教程

《基于C#实现PDF转图片的详细教程》在数字化办公场景中,PDF文件的可视化处理需求日益增长,本文将围绕Spire.PDFfor.NET这一工具,详解如何通过C#将PDF转换为JPG、PNG等主流图片... 目录引言一、组件部署二、快速入门:PDF 转图片的核心 C# 代码三、分辨率设置 - 清晰度的决定因

Java Scanner类解析与实战教程

《JavaScanner类解析与实战教程》JavaScanner类(java.util包)是文本输入解析工具,支持基本类型和字符串读取,基于Readable接口与正则分隔符实现,适用于控制台、文件输... 目录一、核心设计与工作原理1.底层依赖2.解析机制A.核心逻辑基于分隔符(delimiter)和模式匹

Python多线程实现大文件快速下载的代码实现

《Python多线程实现大文件快速下载的代码实现》在互联网时代,文件下载是日常操作之一,尤其是大文件,然而,网络条件不稳定或带宽有限时,下载速度会变得很慢,本文将介绍如何使用Python实现多线程下载... 目录引言一、多线程下载原理二、python实现多线程下载代码说明:三、实战案例四、注意事项五、总结引

spring AMQP代码生成rabbitmq的exchange and queue教程

《springAMQP代码生成rabbitmq的exchangeandqueue教程》使用SpringAMQP代码直接创建RabbitMQexchange和queue,并确保绑定关系自动成立,简... 目录spring AMQP代码生成rabbitmq的exchange and 编程queue执行结果总结s

C#使用Spire.XLS快速生成多表格Excel文件

《C#使用Spire.XLS快速生成多表格Excel文件》在日常开发中,我们经常需要将业务数据导出为结构清晰的Excel文件,本文将手把手教你使用Spire.XLS这个强大的.NET组件,只需几行C#... 目录一、Spire.XLS核心优势清单1.1 性能碾压:从3秒到0.5秒的质变1.2 批量操作的优雅

Java List 使用举例(从入门到精通)

《JavaList使用举例(从入门到精通)》本文系统讲解JavaList,涵盖基础概念、核心特性、常用实现(如ArrayList、LinkedList)及性能对比,介绍创建、操作、遍历方法,结合实... 目录一、List 基础概念1.1 什么是 List?1.2 List 的核心特性1.3 List 家族成

Mybatis-Plus 3.5.12 分页拦截器消失的问题及快速解决方法

《Mybatis-Plus3.5.12分页拦截器消失的问题及快速解决方法》作为Java开发者,我们都爱用Mybatis-Plus简化CRUD操作,尤其是它的分页功能,几行代码就能搞定复杂的分页查询... 目录一、问题场景:分页拦截器突然 “失踪”二、问题根源:依赖拆分惹的祸三、解决办法:添加扩展依赖四、分页

c++日志库log4cplus快速入门小结

《c++日志库log4cplus快速入门小结》文章浏览阅读1.1w次,点赞9次,收藏44次。本文介绍Log4cplus,一种适用于C++的线程安全日志记录API,提供灵活的日志管理和配置控制。文章涵盖... 目录简介日志等级配置文件使用关于初始化使用示例总结参考资料简介log4j 用于Java,log4c

史上最全MybatisPlus从入门到精通

《史上最全MybatisPlus从入门到精通》MyBatis-Plus是MyBatis增强工具,简化开发并提升效率,支持自动映射表名/字段与实体类,提供条件构造器、多种查询方式(等值/范围/模糊/分页... 目录1.简介2.基础篇2.1.通用mapper接口操作2.2.通用service接口操作3.进阶篇3