Skip to the content.

GPU performance in TFLOPS and TIOPs

This page uses two benchmarks:

This test is taken from https://github.com/ProjectPhysX/OpenCL-Benchmark. If you get an error message about the missing msvcp140.dll you might have to install the latest Microsoft Visual C++ Redistributable first (with permalink).

Device FP64
double
FP32
single
FP16
half
INT64
long
INT32
int
INT16
short
INT8
char
units TFLOPs/s TFLOPs/s TFLOPs/s TIOPs/s TIOPs/s TIOPs/s TIOPs/s
i5 3320M 0.000 0.000 0.003 0.016 0.032 0.018
E3-1226 v3 0.047 0.046 0.013 0.021 0.005 0.011
i7-6820HQ 0.100 0.098 0.029 0.038 0.142 0.159
i3-10100 0.119 0.138 0.041 0.053 0.197 0.217
i7-8700 0.201 0.197 0.059 0.076 0.279 0.300
i7-13700T 0.272 0.220 0.054 0.071 0.127 0.397 0.406
E5-2696 v3 0.280 0.281 0.076 0.058 0.125 0.478 0.514
🔵 HD Gen11 0.182 0.333 0.008 0.030 0.361 0.063
🔵 UHD 620 0.097 0.365 0.659 0.013 0.115 0.642 0.129
🔵 UHD 630 0.102 0.395 0.722 0.015 0.135 0.782 0.136
🔵 UHD 770 0.700 1.292 0.060 0.252 2.838 2.881
🟢 Quadro M1000M 0.035 0.734 0.192 0.308 1.071 1.087
⚪ M1 GPU 8CU 0.620 0.439 0.603 0.645 0.638
🟢 GTX 960 0.086 2.597 0.551 0.918 2.649 2.652
🔴 RX 470 0.306 1.218 4.749 0.686 0.985 1.920 1.914
🟢 P106-100 0.151 4.526 0.076 0.859 1.512 4.542 16.395
🟢 GTX 1060 0.149 4.466 0.075 0.821 1.435 4.465 4.496
🟢 GTX 1070 0.225 6.710 0.113 1.254 2.182 6.549 23.718
🟢 P104-100 0.223 6.657 0.111 1.439 2.239 6.673 24.380
🔴 RX 6600 0.570 8.324 16.641 0.466 1.845 7.498 5.564
🟢 T4 0.250 8.092 1.939 6.326 5.257 5.279
🟢 RTX 3060 Ti 0.287 17.748 18.291 2.799 9.228 8.062 6.844
🟢 RTX 3070 Ti 0.369 22.572 23.276 3.049 11.721 10.198 43.704

Specification:

Device OpenCL CU Freq. Cores TFLOPs/s Memory PCIe
units version # MHz # theorerical GB/s GB/s
i5 3320M 1.2 4 2600 2 0.166 27.65 6.93
E3-1226 v3 1.2 4 3300 2 0.211 22.11 8.73
i7-6820HQ 3.0 8 2700 4 0.346 32.57 11.92
i3-10100 3.0 8 3600 4 0.461 35.49 13.66
i7-8700 3.0 12 3200 6 0.614 34.66 13.03
i7-13700T 3.0 24 2400 16 0.000 42.55 18.39
E5-2696 v3 3.0 36 2300 18 1.325 8.27 1.56
🔵 HD Gen 11 1.2 16 750 128 0.192 16.13 6.26
🔵 UHD 620 3.0 24 1100 192 0.422 14.47 6.28
🔵 UHD 630 3.0 23 1100 184 0.405 29.89 15.30
🔵 UHD 770 3.0 32 1600 256 0.819 45.25 20.12
🟢 Quadro M1000M 1.2 2 1071 512 1.097 71.74 6.35
⚪ M1 GPU 8CU 1.2 8 1000 1024 2.048 65.54 18.28
🟢 GTX 960 1.2 8 1266 1024 2.593 97.41 6.91
🔴 RX 470 2.0 32 1226 2048 5.022 193.25 6.40
🟢 P106-100 3.0 10 1708 1280 4.372 175.52 3.33
🟢 GTX 1060 1.2 10 1708 1280 4.372 162.14 6.95
🟢 GTX 1070 3.0 15 1683 1920 6.463 220.41 3.24
🟢 P104-100 3.0 15 1733 1920 6.655 314.02 0.84
🔴 RX 6600 2.0 16 2044 1792 7.326 204.61 4.57
🟢 T4 1.2 40 1590 2560 8.141 245.42 4.74
🟢 RTX 3060 Ti 1.2 38 1665 4864 16.197 423.68 9.83
🟢 RTX 3070 Ti 3.0 48 1770 6144 21.750 574.81 8.76

I need more time to find software and do the measurements, but I was inpired by the comparison of the graphics performance of my PS4 Pro to other consoles in 2020. Above results are taken early 2024.

Install extra drivers

The instructions are printed out if no CPU or GPU for OpenCL is found. They are in the opencl.hpp file.

AMD GPU

AMD GPU Drivers, which contain the OpenCL Runtime

sudo apt update && sudo apt upgrade -y
sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev
mkdir -p ~/amdgpu
wget -P ~/amdgpu https://repo.radeon.com/amdgpu-install/6.1.3/ubuntu/jammy/amdgpu-install_6.1.60103-1_all.deb
sudo apt install -y ~/amdgpu/amdgpu-install*.deb
sudo amdgpu-install -y --usecase=graphics,rocm,opencl --opencl=rocr
sudo usermod -a -G render,video $(whoami)
rm -r ~/amdgpu
sudo shutdown -r now

Intel GPU

Intel GPU Drivers are already installed, only the OpenCL Runtime is needed

sudo apt update && sudo apt upgrade -y
sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev intel-opencl-icd
sudo usermod -a -G render $(whoami)
sudo shutdown -r now

Nvidia GPU

Nvidia GPU Drivers, which contain the OpenCL Runtime

sudo apt update && sudo apt upgrade -y
sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev nvidia-driver-550
sudo shutdown -r now

Intel and AMD CPU Runtime for OpenCL

CPU Option 1: Intel CPU Runtime for OpenCL (works for both AMD/Intel CPUs)

export OCLV="2024.18.6.0.02_rel"
export TBBV="2021.13.0"
sudo apt update && sudo apt upgrade -y
sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev
sudo mkdir -p ~/cpurt /opt/intel/oclcpuexp_${OCLV} /etc/OpenCL/vendors /etc/ld.so.conf.d
sudo wget -P ~/cpurt https://github.com/intel/llvm/releases/download/2024-WW25/oclcpuexp-${OCLV}.tar.gz
sudo wget -P ~/cpurt https://github.com/oneapi-src/oneTBB/releases/download/v${TBBV}/oneapi-tbb-${TBBV}-lin.tgz
sudo tar -zxvf ~/cpurt/oclcpuexp-${OCLV}.tar.gz -C /opt/intel/oclcpuexp_${OCLV}
sudo tar -zxvf ~/cpurt/oneapi-tbb-${TBBV}-lin.tgz -C /opt/intel
echo /opt/intel/oclcpuexp_${OCLV}/x64/libintelocl.so | sudo tee /etc/OpenCL/vendors/intel_expcpu.icd
echo /opt/intel/oclcpuexp_${OCLV}/x64 | sudo tee /etc/ld.so.conf.d/libintelopenclexp.conf
sudo ln -sf /opt/intel/oneapi-tbb-${TBBV}/lib/intel64/gcc4.8/libtbb.so /opt/intel/oclcpuexp_${OCLV}/x64
sudo ln -sf /opt/intel/oneapi-tbb-${TBBV}/lib/intel64/gcc4.8/libtbbmalloc.so /opt/intel/oclcpuexp_${OCLV}/x64
sudo ln -sf /opt/intel/oneapi-tbb-${TBBV}/lib/intel64/gcc4.8/libtbb.so.12 /opt/intel/oclcpuexp_${OCLV}/x64
sudo ln -sf /opt/intel/oneapi-tbb-${TBBV}/lib/intel64/gcc4.8/libtbbmalloc.so.2 /opt/intel/oclcpuexp_${OCLV}/x64
sudo ldconfig -f /etc/ld.so.conf.d/libintelopenclexp.conf
sudo rm -r ~/cpurt

CPU Option 2: PoCL

sudo apt update && sudo apt upgrade -y
sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev pocl-opencl-icd

Run clpeak Benchmark

This OpenCL benchmark clpeak is created by Krishnaraj Bhat and others. To install you need to:

cd ~/ && mkdir workspace && cd workspace
git clone https://github.com/krrishnarraj/clpeak.git
cd clpeak
git submodule update --init --recursive --remote
mkdir build && cd build
cmake ..
make -j4
./clpeak

The results with 2, 4, 8 and 16 show the vector width for the multiplications, starting with 1.

FP32 single

The performance of GPUs in the 2010 and 2020 years is optimized for FP32 single precision. Therefore the computing power in GFLOPS is often indicated for this size. FP64 or double precision is significantly slower. That is the size supercomputer in the 1970 to 1990 were measured in for scientifc calculations.

With the boom of AI and transformer models in machine learning edge computing is done in INT8 which is significantly easier to implement and more power efficient. NPUs since 2015 in smartphones use this metric for their speed. Since 2024 Microsoft calls PCs with at least 40 TOPS an AI PC to better support Copilot.

This is an old list of mine where I just collected data, but not measured:

Let’s assume this is possible max raw performance in long (32 bit or single) FP32

In many cases it can be simple calculated by the CPU architecture and the frequency. For example my dual Xeon X5550 with 2.67 GHz has a multiplier of 8 (Nehalem EP) which results in 2.67 x 8 = 21.36 gflops.

FP64 double

The old value for supercomputers. Refinement follows. And OpenCL might not be the best option. For example the T4 scores 5.2 TOPS in INT8 with OpenCL, but is actually capable of 130 TOPS with the 320 Turing Tensor cores.