Make sure you compiled llama with the correct env variables according to this guide, so that llama accepts the -ngl N
(or --n-gpu-layers N
) flag. When running llama, you may configure N
to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example:
./main -m "path/to/model.bin" -ngl 200000 -p "Please sir, may I have some "
When running llama, before it starts the inference work, it will output diagnostic information that shows whether cuBLAS is offloading work to the GPU. Look for these lines:
llama_model_load_internal: [cublas] offloading 60 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 17223 MB
... rest of inference
If you see these lines, then the GPU is being used.
llama accepts a -t N
(or --threads N
) parameter. It's extremely important that this parameter is not too large. If your token generation is extremely slow, try setting this number to 1. If this significantly improves your token generation speed, then your CPU is being oversaturated and you need to explicitly set this parameter to the number of the physicial CPU cores on your machine (even if you utilize a GPU). If in doubt, start with 1 and double the amount until you hit a performance bottleneck, then scale the number down.
These runs were tested on the following machine: GPU: A6000 (48GB VRAM) CPU: 7 physical cores RAM: 32GB
Model: TheBloke_Wizard-Vicuna-30B-Uncensored-GGML/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin
(30B parameters, 4bit quantization, GGML)
Run command: ./main -m "path/to/model.bin" -p "-p "An extremely detailed description of the 10 best ethnic dishes will follow, with recipes: " -n 1000 [additional benchmark flags]
Result:
command | tokens/second (higher is better) |
---|---|
-ngl 2000000 | N/A (less than 0.1) |
-t 7 | 1.7 |
-t 1 -ngl 2000000 | 5.5 |
-t 7 -ngl 2000000 | 8.7 |
-t 4 -ngl 2000000 | 9.1 |
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。