Compiling and running llama.cpp
The instructions are available on the llama.cpp website, but in general miss the libcurls, which gets the error message
llama_load_model_from_hf: llama.cpp built without libcurl, downloading from Hugging Face not supported.
Therefore my procedure is:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_CURL=ON
cmake --build build --config Release
Also, if I remember correctly you might need the libcurl4-openssl-dev package as well.
sudo apt install libcurl4-openssl-dev
Run the first model
# Load and run a small model:
llama-cli -hf bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF
./build/bin/llama-cli -hf bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF
It will download the GGUF
file to your ~/.cache/llama.cpp/
folder.
After a while you have your input prompt, and you can say simple things like Hi
or ask questions like How many R's are in the word STRAWBERRY
. To exit you type Ctrl-C
.
Compile with CUDA support
Toolkit
Have the CUDA Toolkit installed. In my case its a few commands:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pin
sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda-repo-ubuntu2404-12-8-local_12.8.0-570.86.10-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2404-12-8-local_12.8.0-570.86.10-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2404-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8
Driver
You need to
sudo apt-get install -y nvidia-open
Compile
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build build --config Release
Run with CUDA support
You need to set the -ngl
or --n-gpu-layers
to 999, otherwise only the CPU will be utilized
./build/bin/llama-cli -ngl 999 -hf bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF
Benchmarking with llama-bench
CPU
$ ./llama-bench -m .cache/llama.cpp/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1.5B Q4_K - Medium | 1.04 GiB | 1.78 B | CPU | 4 | pp512 | 58.53 ± 6.88 |
| qwen2 1.5B Q4_K - Medium | 1.04 GiB | 1.78 B | CPU | 4 | tg128 | 24.18 ± 3.41 |
build: 19d3c829 (4677)
GPU
$ ./llama-bench -m /mnt/data/models/DeepSeek-R1-1.5B-Q4_K_M.gguf -ngl 999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 1.5B Q4_K - Medium | 1.04 GiB | 1.78 B | CUDA | 999 | pp512 | 10292.97 ± 584.16 |
| qwen2 1.5B Q4_K - Medium | 1.04 GiB | 1.78 B | CUDA | 999 | tg128 | 233.28 ± 2.37 |
build: 19d3c829 (4677)
The old instructions were: ./llama.cpp/build/bin/llama-bench -m .cache/llama.cpp/bartowski_DeepSeek-R1-Distill-Qwen-1.5B-GGUF_DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf -ngl 999