Deepseek R1
The new Deepseek R1 was released on January 20th, 2025 with 671b parameters. With a reported development costs of 5.5 million USD (see paper page 5) it gained a lot of interest in the coming days, affecting the stock prices of many tech companies in the following days.
Performance of distilled models
Several smaller models were released to run on consumer hardware, since the full model still requires 404 GB VRAM to run with the full 671 billion parameters. I tried to run a few of them on my hardware to test the performance.
1.5b model based on qwen2 1.5B
The ollama website states that the Q4_K_M requires 1.1GB. It has 29 layers, and ollama ps
reports 1.1 to 2.0 GB used VRAM, more on GPUs. It was not possible to export all of them on a RTX 960 with 2 GB VRAM.
CPU/GPU | ollama ps |
memory_required | token/s |
---|---|---|---|
Jetson Nano | 1.1 GB | 1.0 GiB | 3.13 |
Raspberry Pi 4 | 1.6 GB | 1.5 GiB | 3.14 |
i3-6100 | 1.1 GB | 1.0 GiB | 15.53 |
RTX 1060 | 2.0 GB | 1.9 GiB | 61.21 |
GTX 3060 Ti | 2.0 GB | 1.9 GiB | 119.33 |
GTX 3070 Ti | 2.0 GB | 1.9 GiB | 145.45 |
7b model based on qwen 7B
The website states that it requres 4.7 GB, running on a GPU it uses 6.0 GB. For the 29 layers are 5.6 GiB required. With 7.62 B parameters we get 83 tokens/s.
8b model based on llama 8B
The website states that it requres 4.9 GB, running on a GPU it uses 5.8 GB. For the 33 layers are 5.4 GiB required. With 8.03 B parameters we get 79 tokens/s.
14b model based on qwen 14B
The website staes 9.0 GB for the Q4_K_M model, while ollama ps
states 10 to 16 GB. Splitting up less increases performance. The model has 49 layers.
ollama ps |
memory_required | token/s | offload layers | CPU/GPU | GB GPUs |
---|---|---|---|---|---|
10 GB | 9.5 GiB | 2.1 | 100/0 | 0 | |
10 GB | 10.0 GiB | 4.1 | 30 | 37/63 | 8 |
16 GB | 15.6 GiB | 10.8 | 13/12/12/12 | 0/100 | 8/6/6/6 |
15 GB | 14.1 GiB | 14.2 | 17/16/16 | 0/100 | 8/6/6 |
32b model based on qwen 32B
The website staes 20 GB for the Q4_K_M model, while ollama ps
states 22 to 26 GB. It currently does not fit into 4 GPUs with combined 26 GByte VRAM. The model has 65 layers.
ollama ps |
memory_required | token/s | offload layers | CPU/GPU | GB GPUs |
---|---|---|---|---|---|
22 GB | 20.9 GiB | 0.92 | 100 | ||
25 GB | 23.6 GiB | 2.34 | 21/14/15 | 20/80 | 8/6/6 |
26 GB | 25.1 GiB | 5.11 | 19/15/15/15 | 2/98 | 8/6/6/6 |
70b model based on llama3 70b
To be tested on the E5-2696v3.
671b model
I have no hardware to run a model that needs 404 GB when qunatized to 4bit.
Documentation
Deepseek has some documentation in their github profile. This includes:
- A paper describing the model, 20 pages
- A technical report, 48 pages
MLX faster than llama.cpp in Ollama
This needs further testing, but with LM Studio on my M1 Mac 16GB I get 45 t/s on mlx compared to 41 t/s for DeekSeek qwen2 1.5b 4Q_K_M. See the logfiles in this folder.