GPU performance in TFLOPS and TIOPs

This page uses two benchmarks:

OpenCL-Benchmark from Project PhysX by Dr. Lehmann
clpeak benchmark

This test is taken from https://github.com/ProjectPhysX/OpenCL-Benchmark. If you get an error message about the missing msvcp140.dll you might have to install the latest Microsoft Visual C++ Redistributable first (with permalink).

Device	FP64 double	FP32 single	FP16 half	INT64 long	INT32 int	INT16 short	INT8 char
units	TFLOPs/s	TFLOPs/s	TFLOPs/s	TIOPs/s	TIOPs/s	TIOPs/s	TIOPs/s
i5 3320M	0.000	0.000	—	0.003	0.016	0.032	0.018
E3-1226 v3	0.047	0.046	—	0.013	0.021	0.005	0.011
i7-6820HQ	0.100	0.098	—	0.029	0.038	0.142	0.159
i3-10100	0.119	0.138	—	0.041	0.053	0.197	0.217
i7-8700	0.201	0.197	—	0.059	0.076	0.279	0.300
i7-13700T	0.272	0.220	0.054	0.071	0.127	0.397	0.406
E5-2696 v3	0.280	0.281	0.076	0.058	0.125	0.478	0.514
🔵 HD Gen11	—	0.182	0.333	0.008	0.030	0.361	0.063
🔵 UHD 620	0.097	0.365	0.659	0.013	0.115	0.642	0.129
🔵 UHD 630	0.102	0.395	0.722	0.015	0.135	0.782	0.136
🔵 UHD 770	—	0.700	1.292	0.060	0.252	2.838	2.881
🟢 Quadro M1000M	0.035	0.734	—	0.192	0.308	1.071	1.087
⚪ M1 GPU 8CU	—	0.620	—	0.439	0.603	0.645	0.638
🟢 GTX 960	0.086	2.597	—	0.551	0.918	2.649	2.652
🔴 RX 470	0.306	1.218	4.749	0.686	0.985	1.920	1.914
🟢 P106-100	0.151	4.526	0.076	0.859	1.512	4.542	16.395
🟢 GTX 1060	0.149	4.466	0.075	0.821	1.435	4.465	4.496
🟢 GTX 1070	0.225	6.710	0.113	1.254	2.182	6.549	23.718
🟢 P104-100	0.223	6.657	0.111	1.439	2.239	6.673	24.380
🔴 RX 6600	0.570	8.324	16.641	0.466	1.845	7.498	5.564
🟢 T4	0.250	8.092	—	1.939	6.326	5.257	5.279
🟢 RTX 3060 Ti	0.287	17.748	18.291	2.799	9.228	8.062	6.844
🟢 RTX 3070 Ti	0.369	22.572	23.276	3.049	11.721	10.198	43.704

Specification:

Device	OpenCL	CU	Freq.	Cores	TFLOPs/s	Memory	PCIe
units	version	#	MHz	#	theorerical	GB/s	GB/s
i5 3320M	1.2	4	2600	2	0.166	27.65	6.93
E3-1226 v3	1.2	4	3300	2	0.211	22.11	8.73
i7-6820HQ	3.0	8	2700	4	0.346	32.57	11.92
i3-10100	3.0	8	3600	4	0.461	35.49	13.66
i7-8700	3.0	12	3200	6	0.614	34.66	13.03
i7-13700T	3.0	24	2400	16	0.000	42.55	18.39
E5-2696 v3	3.0	36	2300	18	1.325	8.27	1.56
🔵 HD Gen 11	1.2	16	750	128	0.192	16.13	6.26
🔵 UHD 620	3.0	24	1100	192	0.422	14.47	6.28
🔵 UHD 630	3.0	23	1100	184	0.405	29.89	15.30
🔵 UHD 770	3.0	32	1600	256	0.819	45.25	20.12
🟢 Quadro M1000M	1.2	2	1071	512	1.097	71.74	6.35
⚪ M1 GPU 8CU	1.2	8	1000	1024	2.048	65.54	18.28
🟢 GTX 960	1.2	8	1266	1024	2.593	97.41	6.91
🔴 RX 470	2.0	32	1226	2048	5.022	193.25	6.40
🟢 P106-100	3.0	10	1708	1280	4.372	175.52	3.33
🟢 GTX 1060	1.2	10	1708	1280	4.372	162.14	6.95
🟢 GTX 1070	3.0	15	1683	1920	6.463	220.41	3.24
🟢 P104-100	3.0	15	1733	1920	6.655	314.02	0.84
🔴 RX 6600	2.0	16	2044	1792	7.326	204.61	4.57
🟢 T4	1.2	40	1590	2560	8.141	245.42	4.74
🟢 RTX 3060 Ti	1.2	38	1665	4864	16.197	423.68	9.83
🟢 RTX 3070 Ti	3.0	48	1770	6144	21.750	574.81	8.76

I need more time to find software and do the measurements, but I was inpired by the comparison of the graphics performance of my PS4 Pro to other consoles in 2020. Above results are taken early 2024.

Install extra drivers

The instructions are printed out if no CPU or GPU for OpenCL is found. They are in the opencl.hpp file.

AMD GPU

AMD GPU Drivers, which contain the OpenCL Runtime

sudo apt update && sudo apt upgrade -y
sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev
mkdir -p ~/amdgpu
wget -P ~/amdgpu https://repo.radeon.com/amdgpu-install/6.1.3/ubuntu/jammy/amdgpu-install_6.1.60103-1_all.deb
sudo apt install -y ~/amdgpu/amdgpu-install*.deb
sudo amdgpu-install -y --usecase=graphics,rocm,opencl --opencl=rocr
sudo usermod -a -G render,video $(whoami)
rm -r ~/amdgpu
sudo shutdown -r now

Intel GPU

Intel GPU Drivers are already installed, only the OpenCL Runtime is needed

sudo apt update && sudo apt upgrade -y
sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev intel-opencl-icd
sudo usermod -a -G render $(whoami)
sudo shutdown -r now

Nvidia GPU

Nvidia GPU Drivers, which contain the OpenCL Runtime

sudo apt update && sudo apt upgrade -y
sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev nvidia-driver-550
sudo shutdown -r now

Intel and AMD CPU Runtime for OpenCL

CPU Option 1: Intel CPU Runtime for OpenCL (works for both AMD/Intel CPUs)

export OCLV="2024.18.6.0.02_rel"
export TBBV="2021.13.0"
sudo apt update && sudo apt upgrade -y
sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev
sudo mkdir -p ~/cpurt /opt/intel/oclcpuexp_${OCLV} /etc/OpenCL/vendors /etc/ld.so.conf.d
sudo wget -P ~/cpurt https://github.com/intel/llvm/releases/download/2024-WW25/oclcpuexp-${OCLV}.tar.gz
sudo wget -P ~/cpurt https://github.com/oneapi-src/oneTBB/releases/download/v${TBBV}/oneapi-tbb-${TBBV}-lin.tgz
sudo tar -zxvf ~/cpurt/oclcpuexp-${OCLV}.tar.gz -C /opt/intel/oclcpuexp_${OCLV}
sudo tar -zxvf ~/cpurt/oneapi-tbb-${TBBV}-lin.tgz -C /opt/intel
echo /opt/intel/oclcpuexp_${OCLV}/x64/libintelocl.so | sudo tee /etc/OpenCL/vendors/intel_expcpu.icd
echo /opt/intel/oclcpuexp_${OCLV}/x64 | sudo tee /etc/ld.so.conf.d/libintelopenclexp.conf
sudo ln -sf /opt/intel/oneapi-tbb-${TBBV}/lib/intel64/gcc4.8/libtbb.so /opt/intel/oclcpuexp_${OCLV}/x64
sudo ln -sf /opt/intel/oneapi-tbb-${TBBV}/lib/intel64/gcc4.8/libtbbmalloc.so /opt/intel/oclcpuexp_${OCLV}/x64
sudo ln -sf /opt/intel/oneapi-tbb-${TBBV}/lib/intel64/gcc4.8/libtbb.so.12 /opt/intel/oclcpuexp_${OCLV}/x64
sudo ln -sf /opt/intel/oneapi-tbb-${TBBV}/lib/intel64/gcc4.8/libtbbmalloc.so.2 /opt/intel/oclcpuexp_${OCLV}/x64
sudo ldconfig -f /etc/ld.so.conf.d/libintelopenclexp.conf
sudo rm -r ~/cpurt

CPU Option 2: PoCL

sudo apt update && sudo apt upgrade -y
sudo apt install -y g++ git make ocl-icd-libopencl1 ocl-icd-opencl-dev pocl-opencl-icd

Run `clpeak` Benchmark

This OpenCL benchmark clpeak is created by Krishnaraj Bhat and others. To install you need to:

cd ~/ && mkdir workspace && cd workspace
git clone https://github.com/krrishnarraj/clpeak.git
cd clpeak
git submodule update --init --recursive --remote
mkdir build && cd build
cmake ..
make -j4
./clpeak

The results with 2, 4, 8 and 16 show the vector width for the multiplications, starting with 1.

FP32 single

The performance of GPUs in the 2010 and 2020 years is optimized for FP32 single precision. Therefore the computing power in GFLOPS is often indicated for this size. FP64 or double precision is significantly slower. That is the size supercomputer in the 1970 to 1990 were measured in for scientifc calculations.

With the boom of AI and transformer models in machine learning edge computing is done in INT8 which is significantly easier to implement and more power efficient. NPUs since 2015 in smartphones use this metric for their speed. Since 2024 Microsoft calls PCs with at least 40 TOPS an AI PC to better support Copilot.

This is an old list of mine where I just collected data, but not measured:

Let’s assume this is possible max raw performance in long (32 bit or single) FP32

PS2 GFLOPS 16 Pixel shaders
PS3
PS4 1840 GFLOPS 18 CU, 8 GB GDDR5 memory 5500 MT/s
PS4 Pro 4198 GFLOPS 36 CU, 32 ROPs, 144 TMUs, 2304 Cores, 256 bit bus, 217.6 GB/s
PS5
RX 470 3793 GFLOPS
Apple M1 2600 GFLOPS (8-core, 128 CU or execution units, handle nearly 25,000 threads
XBOX 360
Xbos one S
XBox Series S
XBox Series X

In many cases it can be simple calculated by the CPU architecture and the frequency. For example my dual Xeon X5550 with 2.67 GHz has a multiplier of 8 (Nehalem EP) which results in 2.67 x 8 = 21.36 gflops.

FP64 double

RX 470 237 GFLOPS

The old value for supercomputers. Refinement follows. And OpenCL might not be the best option. For example the T4 scores 5.2 TOPS in INT8 with OpenCL, but is actually capable of 130 TOPS with the 320 Turing Tensor cores.