Docker下使用llama.cpp部署带Function calling和Json Mode功能的Mistral 7B模型

本文主要是介绍Docker下使用llama.cpp部署带Function calling和Json Mode功能的Mistral 7B模型，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

Docker下使用llama.cpp部署带Function calling和Json Mode功能的Mistral 7B模型

说明：

首次发表日期：2024-08-27
参考：
- https://www.markhneedham.com/blog/2024/06/23/mistral-7b-function-calling-llama-cpp/
- https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#function-calling
- https://github.com/abetlen/llama-cpp-python/tree/main/docker#cuda_simple
- https://docs.mistral.ai/capabilities/json_mode/
- https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF
- https://stackoverflow.com/questions/30905674/newer-versions-of-docker-have-cap-add-what-caps-can-be-added
- https://man7.org/linux/man-pages/man7/capabilities.7.html
- https://docs.docker.com/engine/containers/run/#runtime-privilege-and-linux-capabilities
- https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
- https://www.cnblogs.com/davis12/p/14453690.html

下载GGUF模型

使用HuggingFace的镜像 https://hf-mirror.com/

方式一：

pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.comhuggingface-cli download --resume-download MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF --include *Q4_K_M.gguf

方式二（推荐）：

sudo apt update
sudo apt install aria2 git-lfswget https://hf-mirror.com/hfd/hfd.shchmod a+x hfd.sh./hfd.sh MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF --include *Q4_K_M.gguf --tool aria2c -x 16 --local-dir MaziyarPanahi--Mistral-7B-Instruct-v0.3-GGUF

使用Docker部署服务

构建之前需要先安装NVIDIA Container Toolkit

安装NVIDIA Container Toolkit

准备：

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

安装：

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

配置docker

sudo nvidia-ctk runtime configure --runtime=docker

NVIDIA Container Toolkit 安装的更多信息请参考官方文档： https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

构建镜像

使用官方的Dockerfile: https://github.com/abetlen/llama-cpp-python/blob/main/docker/cuda_simple/Dockerfile

ARG CUDA_IMAGE="12.2.0-devel-ubuntu22.04"
FROM nvidia/cuda:${CUDA_IMAGE}# We need to set the host to 0.0.0.0 to allow outside access
ENV HOST 0.0.0.0RUN apt-get update && apt-get upgrade -y \&& apt-get install -y git build-essential \python3 python3-pip gcc wget \ocl-icd-opencl-dev opencl-headers clinfo \libclblast-dev libopenblas-dev \&& mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icdCOPY . .# setting build related env vars
ENV CUDA_DOCKER_ARCH=all
ENV GGML_CUDA=1# Install depencencies
RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context# Install llama-cpp-python (build with cuda)
RUN CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python# Run the server
CMD python3 -m llama_cpp.server

因为我本地安装的CUDA版本为12.2，所以将base镜像改为nvidia/cuda:12.2.0-devel-ubuntu22.04

docker build -t llama_cpp_cuda_simple .

启动服务

docker run --gpus=all --cap-add SYS_RESOURCE -e USE_MLOCK=0 -e model=/models/downloaded/MaziyarPanahi--Mistral-7B-Instruct-v0.3-GGUF/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf -e n_gpu_layers=-1 -e chat_format=chatml-function-calling -v /mnt/d/16-LLM-Cache/llama_cpp_gnuf:/models -p 8000:8000 -t llama_cpp_cuda_simple

其中：

-v 将本地文件夹映射到容器内部文件夹/models
--gpus=all 表示使用所有的GPU
--cap-add SYS_RESOURCE 表示容器将有SYS_RESOURCE的权限
其中以-e开头的表示设置环境变量，实际上是设置llama_cpp.server的参数，相关代码详见 https://github.com/abetlen/llama-cpp-python/blob/259ee151da9a569f58f6d4979e97cfd5d5bc3ecd/llama_cpp/server/main.py#L79 和 https://github.com/abetlen/llama-cpp-python/blob/259ee151da9a569f58f6d4979e97cfd5d5bc3ecd/llama_cpp/server/settings.py#L17 这里设置的环境变量是大小写不敏感的，见 https://docs.pydantic.dev/latest/concepts/pydantic_settings/#case-sensitivity
- -e model 指向模型文件
- -e n_gpu_layers=-1 表示将所有神经网络层移到GPU
  - 假设模型一共有N层，其中n_gpu_layers层被放在GPU上，那么剩下的 N - n_gpu_layers 就会被放在CPU上
- -e chat_format=chatml-function-calling 设置以支持Function Calling功能

启动完成后，在浏览器打开 http://localhost:8000/docs 查看API文档

调用测试

Function Calling

curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-xxxxxxxxxxxxxxxxxxxxxx' \
--data '{"model": "gpt-3.5-turbo","messages": [{"role": "system","content": "You are a helpful assistant.\nYou can call functions with appropriate input when necessary"},{"role": "user","content": "What'\''s the weather like in Mauritius?"}],"tools": [{"type": "function","function": {"name": "get_current_weather","description": "Get the current weather in a given latitude and longitude","parameters": {"type": "object","properties": {"latitude": {"type": "number","description": "The latitude of a place"},"longitude": {"type": "number","description": "The longitude of a place"}},"required": ["latitude", "longitude"]}}}],"tool_choice": "auto"
}'

输出：

{"id": "chatcmpl-50c8e261-2b1a-4285-a6ee-e18a07ce92d9","object": "chat.completion","created": 1724757544,"model": "gpt-3.5-turbo","choices": [{"index": 0,"message": {"content": null,"tool_calls": [{"id": "call__0_get_current_weather_cmpl-97515c72-d214-4ed9-b183-7736199e5be1","type": "function","function": {"name": "get_current_weather","arguments": "{\"latitude\": -20.375, \"longitude\": 57.568} "}}],"role": "assistant","function_call": {"name": "","arguments": "{\"latitude\": -20.375, \"longitude\": 57.568} "}},"logprobs": null,"finish_reason": "tool_calls"}],"usage": {"prompt_tokens": 299,"completion_tokens": 25,"total_tokens": 324}
}

JSON Mode

curl --location "http://localhost:8000/v1/chat/completions" \--header 'Content-Type: application/json' \--header 'Accept: application/json' \--header "Authorization: Bearer sk-xxxxxxxxxxxxxxxxxxxxxx" \--data '{"model": "gpt-3.5-turbo","messages": [{"role": "user","content": "What is the best French cheese? Return the product and produce location in JSON format"}],"response_format": {"type": "json_object"}}'

输出：

{"id": "chatcmpl-bbfecfc5-2ea9-4052-93b2-08f1733e8219","object": "chat.completion","created": 1724757752,"model": "gpt-3.5-turbo","choices": [{"index": 0,"message": {"content": "{\n  \"product\": \"Roquefort\",\n  \"produce_location\": \"France, South of France\"\n}\n  \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t","role": "assistant"},"logprobs": null,"finish_reason": "stop"}],"usage": {"prompt_tokens": 44,"completion_tokens": 50,"total_tokens": 94}
}

使用以下代码将content部分写入到文本：

text = "{\n  \"product\": \"Roquefort\",\n  \"location\": \"France, South of France\"\n}\n \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"with open('resp.txt', 'w') as f:f.write(text)

可以看到内容：

{"product": "Roquefort","location": "France, South of France"
}

这篇关于Docker下使用llama.cpp部署带Function calling和Json Mode功能的Mistral 7B模型的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

Docker下使用llama.cpp部署带Function calling和Json Mode功能的Mistral 7B模型

Docker下使用llama.cpp部署带Function calling和Json Mode功能的Mistral 7B模型

下载GGUF模型

使用Docker部署服务

安装NVIDIA Container Toolkit

构建镜像

启动服务

调用测试

Function Calling

JSON Mode

相关文章

Java中流式并行操作parallelStream的原理和使用方法

Linux join命令的使用及说明

Linux jq命令的使用解读

Linux kill正在执行的后台任务 kill进程组使用详解

详解SpringBoot+Ehcache使用示例

Java 虚拟线程的创建与使用深度解析

Nginx分布式部署流程分析

k8s按需创建PV和使用PVC详解

Redis 基本数据类型和使用详解

Redis中Hash从使用过程到原理说明