GPU performance in GFLOPS
On the way to a GPGPU and to compare the processing speed I tried to collect some theoretical GFLOPS data as well as measuring the computation speed in half-precision, single and double precision (FP16, FP32 and FP64) with OpenCL. In the near future some TOPS with INT8 for NPUs and bf16 numbers for GPUs will be interesting with tne new wave of AI and LLMs.
In many cases the limiting factor for the performance of an LLM is not the sheer processing power of the CPU or GPU, but the bandwidth to access the memory. And the whole model has to fit into the RAM! This limits the size on many consumer GPUs, or makes it expensive to run these models. Fast memory is expensive. I compared the speed of a few links to memory from DDR3 over Ethernet and USB4 with the memory bus of some of my graphics cards:
The difference spans several magnitudes, so I included a logarithmic graph to the right. Even DDR5 with dual channel is no match for GDDR6X RAM or even HBM (high bandwidth memory).
And one can see how much computing power today is available to the average consumer if you compare it to the processing power of supercomputers of the last millenia:
Benchmark with OpenCL
Using the OpenCL-Benchmark tool by Dr. Moritz Lehmann I measured several CPUs and GPUs for their performance in floating point from half to double, and integer values.
Device | FP64 double |
FP32 single |
FP16 half |
INT64 long |
INT32 int |
INT16 short |
INT8 char |
---|---|---|---|---|---|---|---|
units | TFLOPs/s | TFLOPs/s | TFLOPs/s | TIOPs/s | TIOPs/s | TIOPs/s | TIOPs/s |
i5 3320M | 0.000 | 0.000 | — | 0.003 | 0.016 | 0.032 | 0.018 |
E3-1226 v3 | 0.047 | 0.046 | — | 0.013 | 0.021 | 0.005 | 0.011 |
i7-6820HQ | 0.100 | 0.098 | — | 0.029 | 0.038 | 0.142 | 0.159 |
i3-10100 | 0.119 | 0.138 | — | 0.041 | 0.053 | 0.197 | 0.217 |
i7-8700 | 0.201 | 0.197 | — | 0.059 | 0.076 | 0.279 | 0.300 |
i7-13700T | 0.272 | 0.220 | 0.054 | 0.071 | 0.127 | 0.397 | 0.406 |
E5-2696 v3 | 0.280 | 0.281 | 0.076 | 0.058 | 0.125 | 0.478 | 0.514 |
🔵 HD Gen11 | — | 0.182 | 0.333 | 0.008 | 0.030 | 0.361 | 0.063 |
🔵 UHD 620 | 0.097 | 0.365 | 0.659 | 0.013 | 0.115 | 0.642 | 0.129 |
🔵 UHD 630 | 0.102 | 0.395 | 0.722 | 0.015 | 0.135 | 0.782 | 0.136 |
🔵 UHD 770 | — | 0.688 | 1.287 | 0.060 | 0.251 | 2.821 | 0.511 |
🟢 Quadro M1000M | 0.035 | 0.734 | — | 0.192 | 0.308 | 1.071 | 1.087 |
⚪ M1 GPU 8CU | — | 0.620 | — | 0.439 | 0.603 | 0.645 | 0.638 |
🟢 GTX 960 | 0.086 | 2.597 | — | 0.551 | 0.918 | 2.649 | 2.652 |
🔴 RX 470 | 0.306 | 1.218 | 4.749 | 0.686 | 0.985 | 1.920 | 1.914 |
🔴 RX 6600 | 0.570 | 8.324 | 16.641 | 0.466 | 1.845 | 7.498 | 5.564 |
🟢 T4 | 0.250 | 8.092 | — | 1.939 | 6.326 | 5.257 | 5.279 |
🟢 RTX 3060 Ti | 0.287 | 17.748 | 18.291 | 2.799 | 9.228 | 8.062 | 6.844 |
🟢 RTX 3070 Ti | 0.368 | 22.572 | 23.276 | 3.049 | 11.721 | 10.198 | 8.681 |
We also get some further details on the hardware:
Device | OpenCL | CU | Freq. | Cores | TFLOPs/s | Memory | PCIe |
---|---|---|---|---|---|---|---|
units | version | # | MHz | # | theorerical | GB/s | GB/s |
i5 3320M | 1.2 | 4 | 2600 | 2 | 0.166 | 27.65 | 6.93 |
E3-1226 v3 | 1.2 | 4 | 3300 | 2 | 0.211 | 22.11 | 8.73 |
i7-6820HQ | 3.0 | 8 | 2700 | 4 | 0.346 | 32.57 | 11.92 |
i3-10100 | 3.0 | 8 | 3600 | 4 | 0.461 | 35.49 | 13.66 |
i7-8700 | 3.0 | 12 | 3200 | 6 | 0.614 | 34.66 | 13.03 |
i7-13700T | 3.0 | 24 | 2400 | 16 | 0.000 | 42.55 | 18.39 |
E5-2696 v3 | 3.0 | 36 | 2300 | 18 | 1.325 | 8.27 | 1.56 |
🔵 HD Gen 11 | 1.2 | 16 | 750 | 128 | 0.192 | 16.13 | 6.26 |
🔵 UHD 620 | 3.0 | 24 | 1100 | 192 | 0.422 | 14.47 | 6.28 |
🔵 UHD 630 | 3.0 | 24 | 1100 | 192 | 0.422 | 29.89 | 15.30 |
🔵 UHD 770 | 1.2 | 32 | 1600 | 256 | 0.819 | 45.25 | 18.18 |
🟢 Quadro M1000M | 1.2 | 2 | 1071 | 512 | 1.097 | 71.74 | 6.35 |
⚪ M1 GPU 8CU | 1.2 | 8 | 1000 | 1024 | 2.048 | 65.54 | 18.28 |
🟢 GTX 960 | 1.2 | 8 | 1266 | 1024 | 2.593 | 97.41 | 6.91 |
🔴 RX 470 | 2.0 | 32 | 1226 | 2048 | 5.022 | 193.25 | 6.40 |
🔴 RX 6600 | 2.0 | 16 | 2044 | 1792 | 7.326 | 204.61 | 4.57 |
🟢 T4 | 1.2 | 40 | 1590 | 2560 | 8.141 | 245.42 | 4.74 |
🟢 RTX 3060 Ti | 1.2 | 38 | 1665 | 4864 | 16.197 | 423.68 | 9.83 |
🟢 RTX 3070 Ti | 1.2 | 48 | 1770 | 6144 | 21.750 | 574.81 | 8.76 |
Example output:
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID 0 | Tesla T4 |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | Tesla T4 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 535.104.05 (Linux) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 40 at 1590 MHz (2560 cores, 8.141 TFLOPs/s) |
| Memory, Cache | 15102 MB, 1280 KB global / 48 KB local |
| Buffer Limits | 3775 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 0.250 TFLOPs/s (1/32) |
| FP32 compute 8.092 TFLOPs/s ( 1x ) |
| FP16 compute not supported |
| INT64 compute 1.939 TIOPs/s (1/4 ) |
| INT32 compute 6.326 TIOPs/s (2/3 ) |
| INT16 compute 5.257 TIOPs/s (2/3 ) |
| INT8 compute 5.279 TIOPs/s (2/3 ) |
| Memory Bandwidth ( coalesced read ) 245.42 GB/s |
| Memory Bandwidth ( coalesced write) 215.51 GB/s |
| Memory Bandwidth (misaligned read ) 260.63 GB/s |
| Memory Bandwidth (misaligned write) 84.02 GB/s |
| PCIe Bandwidth (send ) 4.74 GB/s |
| PCIe Bandwidth ( receive ) 4.53 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen3 x16) 4.13 GB/s |
|-----------------------------------------------------------------------------|
|-----------------------------------------------------------------------------|
| Done. Press Enter to exit. |
'-----------------------------------------------------------------------------'
Below some text from 2020 when I started to collect some information:
FP32 single precision
Let’s assume this is possible max raw performance in long (32 bit or single) FP32
- PS2 GFLOPS 16 Pixel shaders
- PS3
- PS4 1840 GFLOPS 18 CU, 8 GB GDDR5 memory 5500 MT/s
- PS4 Pro 4198 GFLOPS 36 CU, 32 ROPs, 144 TMUs, 2304 Cores, 256 bit bus, 217.6 GB/s
- PS5
- RX 470 3793 GFLOPS
- Apple M1 2600 GFLOPS (8-core, 128 CU or execution units, handle nearly 25,000 threads
- XBOX 360
- Xbos one S
- XBox Series S
- XBox Series X
In many cases it can be simple calculated by the CPU architecture and the frequency. For example my dual Xeon X5550 with 2.67 GHz has a multiplier of 8 (Nehalem EP) which results in 2.67 x 8 = 21.36 gflops.
FP64 double precision
This is a tricky one. In the 20th century this was the unit for scientific calculations and models, used for weather forecast or earth simulations. I guess it originates with the FE (finite elements) approach to model nature. The more parameters and details in the model, the more granularity you get and the better it represents reality. The flipside: it needs exponentially more computing power. That’s why the number of FLoating Point OPerations per Second (FLOPS) became a unit of measurement for the speed of a Supercomputer. And the precision used as double precision with fp64 was merely implied.
Having less bits per element saves storage, let’s transport more data per cycle and is also faster to compute. The very compute units are getting smaller too. And in the early 2000s we started to use neural networks. Initially the nodes and parameters were stored in fp64, but the precision was not needed. The move to fp32 was swift, twice the data can be analyzed, hardware simplified. Google created the first NPU (Neural Processing Units) and realized that fp32 is still too much. But fp16 does not have the needed orders of magnitude. Realizing that magnitude (the exponent in floating point numbers) is more important than the significant digits (mantisse) a new format for floating points was introduced: bf16 (brain float) that uses the same number of bits for the exponent like bf32, but reduces the precision to fit into just 16 bit. Again halved.
Then came the transformer models, and the Generative Pretrained Transfomers (GPT) from 1.0, 2.0, 3.0 and a special edition of 3.5 in form of ChatGPT. And again it became clear: the more parameters, the better the model. Yet another observation was made: The precision could be further reduced! The original bf16 or fp16 weights cold be reduced to int8 or less, maybe int4? The quantizied models from bf16 to int8 were now 4x smaller and fit in some consumer graphics cards. Processing or evaluation (EV) after prompt processing (PP) is usually only memory bandwidth contrained, so the answer is also generated 4x faster. In comparision generally it’s better to have a model with more parameters but quantized fitting into the RAM than having a model with less parameters but full precision or resolution of the weights. What a time!
And consumer graphics cards are notorius slow in fp64, sometimes a quarter, 1/8 or 1/64 of the fp32 performance. I guess that’s intentional, but also not surprizing since its not needed for 3D games.
Examples:
- RX 470 237 GFLOPS versus 3793 in fp32
- RTX 3060 Ti fp64: 287 versus fp32: 17748, that’s 62x slower
CPUs in general show the same performance, since they operate in 64 bit and feeding only 32 bit in OpenCL seems not to execute two operations in parallel.