Ubuntu20.04配置qwen0.5B记录

本文主要是介绍Ubuntu20.04配置qwen0.5B记录，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

环境简介

Ubuntu20.04、
NVIDIA-SMI 545.29.06、
Cuda 11.4、
python3.10、
pytorch1.11.0

开始搭建

python环境设置

创建虚拟环境

conda create --name qewn python==3.10

预安装modelscope和transformers

pip install modelscope
pip install transformers

安装pytorch

conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3

模型需要下载

创建一个python文件

gedit download.py

里面复制如下内容

from modelscope.hub.file_download import model_file_downloadmodel_dir = model_file_download(model_id='qwen/Qwen1.5-0.5B-Chat-GGUF',file_path='qwen1_5-0_5b-chat-q5_k_m.gguf',revision='master',cache_dir='path/to/local/dir')

运行python文件进行下载

python download.py

下载llama.cpp

使⽤git命令克隆llama.cpp项⽬

git clone https://github.com/ggerganov/llama.cpp

克隆完成之后我们进入llama.cpp目录中，对项目进行编译

cd llama.cpp
make -j

模型下载

在魔搭社区中下载模型运行
https://www.modelscope.cn/models/qwen/Qwen1.5-0.5B-Chat-GGUF/files
本人下载的是qwen1_5-0_5b-chat-q5_k_m.gguf
终端运行，其中模型替换为自己的模型地址（官方给的-cml参数在help中没有找到，且影响运行，所以我删除掉了）
官方：

./main -m /path/to/local/dir/qwen/Qwen1.5-0.5B-Chat-GGUF/qwen1_5-0_5b-chat-q5_k_m.gguf -n 512 --color -i -cml -f prompts/chat-with-qwen.txt

我运行：

./main -m /path/to/local/dir/qwen/Qwen1.5-0.5B-Chat-GGUF/qwen1_5-0_5b-chat-q5_k_m.gguf -n 512 --color -i -f prompts/chat-with-qwen.txt

help内容：

usage: ./main [options]general:-h,    --help, --usage          print usage and exit--version                show version and build info-v,    --verbose                print verbose information--verbosity N            set specific verbosity level (default: 0)--verbose-prompt         print a verbose prompt before generation (default: false)--no-display-prompt      don't print prompt at generation (default: false)-co,   --color                  colorise output to distinguish prompt and user input from generations (default: false)-s,    --seed SEED              RNG seed (default: -1, use random seed for < 0)-t,    --threads N              number of threads to use during generation (default: 8)-tb,   --threads-batch N        number of threads to use during batch and prompt processing (default: same as --threads)-td,   --threads-draft N        number of threads to use during generation (default: same as --threads)-tbd,  --threads-batch-draft N  number of threads to use during batch and prompt processing (default: same as --threads-draft)--draft N                number of tokens to draft for speculative decoding (default: 5)-ps,   --p-split N              speculative decoding split probability (default: 0.1)-lcs,  --lookup-cache-static FNAMEpath to static lookup cache to use for lookup decoding (not updated by generation)-lcd,  --lookup-cache-dynamic FNAMEpath to dynamic lookup cache to use for lookup decoding (updated by generation)-c,    --ctx-size N             size of the prompt context (default: 0, 0 = loaded from model)-n,    --predict N              number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)-b,    --batch-size N           logical maximum batch size (default: 2048)-ub,   --ubatch-size N          physical maximum batch size (default: 512)--keep N                 number of tokens to keep from the initial prompt (default: 0, -1 = all)--chunks N               max number of chunks to process (default: -1, -1 = all)-fa,   --flash-attn             enable Flash Attention (default: disabled)-p,    --prompt PROMPT          prompt to start generation with (default: '')-f,    --file FNAME             a file containing the prompt (default: none)--in-file FNAME          an input file (repeat to specify multiple files)-bf,   --binary-file FNAME      binary file containing the prompt (default: none)-e,    --escape                 process escapes sequences (\n, \r, \t, \', \", \\) (default: true)--no-escape              do not process escape sequences-ptc,  --print-token-count N    print token count every N tokens (default: -1)--prompt-cache FNAME     file to cache prompt state for faster startup (default: none)--prompt-cache-all       if specified, saves user input and generations to cache as wellnot supported with --interactive or other interactive options--prompt-cache-ro        if specified, uses the prompt cache but does not update it-r,    --reverse-prompt PROMPT  halt generation at PROMPT, return control in interactive modecan be specified more than once for multiple prompts-sp,   --special                special tokens output enabled (default: false)-cnv,  --conversation           run in conversation mode (does not print special tokens and suffix/prefix) (default: false)-i,    --interactive            run in interactive mode (default: false)-if,   --interactive-first      run in interactive mode and wait for input right away (default: false)-mli,  --multiline-input        allows you to write or paste multiple lines without ending each in '\'--in-prefix-bos          prefix BOS to user inputs, preceding the `--in-prefix` string--in-prefix STRING       string to prefix user inputs with (default: empty)--in-suffix STRING       string to suffix after user inputs with (default: empty)sampling:--samplers SAMPLERS      samplers that will be used for generation in the order, separated by ';'(default: top_k;tfs_z;typical_p;top_p;min_p;temperature)--sampling-seq SEQUENCE  simplified sequence for samplers that will be used (default: kfypmt)--ignore-eos             ignore end of stream token and continue generating (implies --logit-bias EOS-inf)--penalize-nl            penalize newline tokens (default: false)--temp N                 temperature (default: 0.8)--top-k N                top-k sampling (default: 40, 0 = disabled)--top-p N                top-p sampling (default: 0.9, 1.0 = disabled)--min-p N                min-p sampling (default: 0.1, 0.0 = disabled)--tfs N                  tail free sampling, parameter z (default: 1.0, 1.0 = disabled)--typical N              locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)--repeat-last-n N        last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size)--repeat-penalty N       penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)--presence-penalty N     repeat alpha presence penalty (default: 0.0, 0.0 = disabled)--frequency-penalty N    repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)--dynatemp-range N       dynamic temperature range (default: 0.0, 0.0 = disabled)--dynatemp-exp N         dynamic temperature exponent (default: 1.0)--mirostat N             use Mirostat sampling.Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used.(default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)--mirostat-lr N          Mirostat learning rate, parameter eta (default: 0.1)--mirostat-ent N         Mirostat target entropy, parameter tau (default: 5.0)-l TOKEN_ID(+/-)BIAS     modifies the likelihood of token appearing in the completion,i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'--cfg-negative-prompt PROMPTnegative prompt to use for guidance (default: '')--cfg-negative-prompt-file FNAMEnegative prompt file to use for guidance--cfg-scale N            strength of guidance (default: 1.0, 1.0 = disable)grammar:--grammar GRAMMAR        BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '')--grammar-file FNAME     file to read grammar from-j,    --json-schema SCHEMA     JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON objectFor schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py insteadembedding:--pooling {none,mean,cls}pooling type for embeddings, use model default if unspecifiedcontext hacking:--rope-scaling {none,linear,yarn}RoPE frequency scaling method, defaults to linear unless specified by the model--rope-scale N           RoPE context scaling factor, expands context by a factor of N--rope-freq-base N       RoPE base frequency, used by NTK-aware scaling (default: loaded from model)--rope-freq-scale N      RoPE frequency scaling factor, expands context by a factor of 1/N--yarn-orig-ctx N        YaRN: original context size of model (default: 0 = model training context size)--yarn-ext-factor N      YaRN: extrapolation mix factor (default: -1.0, 0.0 = full interpolation)--yarn-attn-factor N     YaRN: scale sqrt(t) or attention magnitude (default: 1.0)--yarn-beta-slow N       YaRN: high correction dim or alpha (default: 1.0)--yarn-beta-fast N       YaRN: low correction dim or beta (default: 32.0)-gan,  --grp-attn-n N           group-attention factor (default: 1)-gaw,  --grp-attn-w N           group-attention width (default: 512.0)-dkvc, --dump-kv-cache          verbose print of the KV cache-nkvo, --no-kv-offload          disable KV offload-ctk,  --cache-type-k TYPE      KV cache data type for K (default: f16)-ctv,  --cache-type-v TYPE      KV cache data type for V (default: f16)perplexity:--all-logits             return logits for all tokens in the batch (default: false)--hellaswag              compute HellaSwag score over random tasks from datafile supplied with -f--hellaswag-tasks N      number of tasks to use when computing the HellaSwag score (default: 400)--winogrande             compute Winogrande score over random tasks from datafile supplied with -f--winogrande-tasks N     number of tasks to use when computing the Winogrande score (default: 0)--multiple-choice        compute multiple choice score over random tasks from datafile supplied with -f--multiple-choice-tasks Nnumber of tasks to use when computing the multiple choice score (default: 0)--kl-divergence          computes KL-divergence to logits provided via --kl-divergence-base--ppl-stride N           stride for perplexity calculation (default: 0)--ppl-output-type {0,1}  output type for perplexity calculation (default: 0)parallel:-dt,   --defrag-thold N         KV cache defragmentation threshold (default: -1.0, < 0 - disabled)-np,   --parallel N             number of parallel sequences to decode (default: 1)-ns,   --sequences N            number of sequences to decode (default: 1)-cb,   --cont-batching          enable continuous batching (a.k.a dynamic batching) (default: enabled)multi-modality:--mmproj FILE            path to a multimodal projector file for LLaVA. see examples/llava/README.md--image FILE             path to an image file. use with multimodal models. Specify multiple times for batchingbackend:--rpc SERVERS            comma separated list of RPC servers--mlock                  force system to keep model in RAM rather than swapping or compressing--no-mmap                do not memory-map model (slower load but may reduce pageouts if not using mlock)--numa TYPE              attempt optimizations that help on some NUMA systems- distribute: spread execution evenly over all nodes- isolate: only spawn threads on CPUs on the node that execution started on- numactl: use the CPU map provided by numactlif run without this previously, it is recommended to drop the system page cache before using thissee https://github.com/ggerganov/llama.cpp/issues/1437model:--check-tensors          check model tensor data for invalid values (default: false)--override-kv KEY=TYPE:VALUEadvanced option to override model metadata by key. may be specified multiple times.types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false--lora FNAME             apply LoRA adapter (implies --no-mmap)--lora-scaled FNAME S    apply LoRA adapter with user defined scaling S (implies --no-mmap)--lora-base FNAME        optional model to use as a base for the layers modified by the LoRA adapter--control-vector FNAME   add a control vector--control-vector-scaled FNAME SCALEadd a control vector with user defined scaling SCALE--control-vector-layer-range START ENDlayer range to apply the control vector(s) to, start and end inclusive-m,    --model FNAME            model path (default: models/$filename with filename from --hf-fileor --model-url if set, otherwise models/7B/ggml-model-f16.gguf)-md,   --model-draft FNAME      draft model for speculative decoding (default: unused)-mu,   --model-url MODEL_URL    model download url (default: unused)-hfr,  --hf-repo REPO           Hugging Face model repository (default: unused)-hff,  --hf-file FILE           Hugging Face model file (default: unused)retrieval:--context-file FNAME     file to load context from (repeat to specify multiple files)--chunk-size N           minimum length of embedded text chunks (default: 64)--chunk-separator STRING separator between chunks (default: '')passkey:--junk N                 number of times to repeat the junk text (default: 250)--pos N                  position of the passkey in the junk text (default: -1)imatrix:-o,    --output FNAME           output file (default: 'imatrix.dat')--output-frequency N     output the imatrix every N iterations (default: 10)--save-frequency N       save an imatrix copy every N iterations (default: 0)--process-output         collect data for the output tensor (default: false)--no-ppl                 do not compute perplexity (default: true)--chunk N                start processing the input from chunk N (default: 0)bench:-pps                            is the prompt shared across parallel sequences (default: false)-npp n0,n1,...                  number of prompt tokens-ntg n0,n1,...                  number of text generation tokens-npl n0,n1,...                  number of parallel promptsserver:--host HOST              ip address to listen (default: 127.0.0.1)--port PORT              port to listen (default: 8080)--path PATH              path to serve static files from (default: )--embedding(s)           enable embedding endpoint (default: disabled)--api-key KEY            API key to use for authentication (default: none)--api-key-file FNAME     path to file containing API keys (default: none)--ssl-key-file FNAME     path to file a PEM-encoded SSL private key--ssl-cert-file FNAME    path to file a PEM-encoded SSL certificate--timeout N              server read/write timeout in seconds (default: 600)--threads-http N         number of threads used to process HTTP requests (default: -1)--system-prompt-file FNAMEset a file to load a system prompt (initial prompt of all slots), this is useful for chat applications--log-format {text,json} log output format: json or text (default: json)--metrics                enable prometheus compatible metrics endpoint (default: disabled)--no-slots               disables slots monitoring endpoint (default: enabled)--slot-save-path PATH    path to save slot kv cache (default: disabled)--chat-template JINJA_TEMPLATEset custom jinja chat template (default: template taken from model's metadata)only commonly used templates are accepted:https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template-sps,  --slot-prompt-similarity SIMILARITYhow much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)logging:--simple-io              use basic IO for better compatibility in subprocesses and limited consoles-ld,   --logdir LOGDIR          path under which to save YAML logs (no logging if unset)--log-test               Run simple logging test--log-disable            Disable trace logs--log-enable             Enable trace logs--log-file FNAME         Specify a log filename (without extension)--log-new                Create a separate new log file on start. Each log file will have unique name: "<name>.<ID>.log"--log-append             Don't truncate the old log file.