Skip to the content.

ML - machine learning

This is just a documentation of my learning progress.

2018 - Start with Object Detection

Inspired by object detection for cars with DarkNet (see this TED talk from 2017 by Joseph Redmon) and David’s bachelor work at HCMUTE in connection with a car at the end of 2018 I started to learn more about machine learning.

Posenet runs on TensorFlow.lite in a browser on WebGL even on a smartphone. We tested it in December 2018 in Seoul, Korea. In March 2019 I got TensorFlow.js running with my RX470 with 43 fps.

Posenet the park

During 2019 NVIDIA announced the Jetson Nano developer kit and with students from AISVN we try to win one in a competition. Eventually we order a package.

Jetson Nano car

Early 2020 some supply chains delay orders, but we finally have the hardware. Now it needs to be combined - and development stalls until 2024.

Facemesh example

Facemesh example

Schedule for 2020

In this article Harsheev Desai describes his journey to become a TensorFlow Developer with Certificate in 5 months.

1. Learn Python

2. Learn Machine Learning Theory

3. Learn Data Science Libraries

Some of these libraries are Pandas (data manipulation and analysis), Numpy (support for multi-dimensional arrays and matrices), Matplotlib (plotting) and Scikitlearn (creating ML models).

4. Deep Learning Theory

5. TensorFlow Certificate

One reason for tensorflow can be seen in this graph regarding popularity on stack overflow:

popularity tensorflow

More about the certificate here on medium. It was introduced in March 2020 but by 2024 it no longer exists.

2022 - Teach ML in Advanced Automation at SSIS in Unit 5

As covered in a SSIS Stories in March 2022 we made great progress in creating our own Neural Network, Training it and then doing interference on them. See also our website.

If you think about possible learning experiences, we tried a few ones with our students:

2024 - start with LLMs

Andrej Karpathy offers a step-py-step guide to build your own Generative Pre-trained Transformer (GPT) starting with 1,000,000 characters from Shakespeare that you can train on your own GPU. Well, at least if it supports CUDA >7.0, otherwise the triton compiler throws an error (like on my slightly older GTX 960):

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Found NVIDIA GeForce GTX 960 which is too old to be supported by the triton GPU compiler, which is used
as the backend. Triton only supports devices of CUDA Capability >= 7.0, but your device is of CUDA capability 5.2

Let’s see what I have and what CUDA Compute Capabilities (CC) these support:

GPU name Cores CC at architecture RAM GB
Quadro FX580 32 1.1 hp Z600 Tesla (2006) 0.5
GTX 650 384 3.0 E3-1226 v3 Kepler (2012) 1
GT750M 384 3.0 MBPr15 2014 Kepler (2012) 0.5
M1000M 512 5.0 Zbook 15 G3 Kepler (2012) 1
GTX960 1024 5.2 E5-2696 v3 Maxwell (2014) 2
Jetson Nano 128 5.3   Maxwell (2014) 4
T4 2560 7.5 Google Colab Turing (2018) 16
RTX3060 Ti 4864 8.6 i7-8700 Ampere (2020) 8
RTX3070 Ti 6144 8.6 i3-10100 Ampere (2020) 8

Only two of 8 are supported by the Triton GPU compiler. How about a newer GPU? At least I can use the T4 in Google’s collaboratory for free. The training taikes one hour. And you get two hours for free. Ollama only needs CUDA Compute Capability 5.0 and can therefore run on 5 of my graphic cards. Plus the RX 6600 with ROCm and a hack.

Triton Compatibility (supported hardware):

My AMD RX 470, RX 580 and RX 6600 are too old to be supported by ROCm, even though the 6600 already uses RNDA2. The RX 6600 can be used if the llvm target is overwritten to be gfx1030 instead of gfx1032. The ROCm installation needs 30 GB! In this regard it seems Nvidia is ahead of the game for some time now with their proprietary CUDA since 2007. Support for the first Tesla GPUs with Compute Capability 1.1 was only dropped with CUDA SDK 7.0 in 2016. For the current CUDA SDK 12.0 (since 2022) a CC of 5.0 (Maxwell and newer since 2014) is required. That’s true for ollama, too. In 2024 that’s 10 year old hardware.

Inference on local hardware

In early 2023 I ran a 8b model with a 4bit quantization on my MacBook Pro at SSIS. It was impressive to see what’s possible with 8GB of RAM on a laptop! It became obvious that you need more RAM for larger models, so I build a new workstation with 128 GB RAM and a 18-core E5-2696 v3 CPU in early 2024. Well, it became another learning experience:

performance

Turns out that the token creation rate is inverse proportional to the size of the model! Or the time to create a token for the answer (TG) is proportional to the RAM speed. A large model might fit into your RAM or VRAM, but the larger the model, the slower an answer will be. The above graph has quantization 4 bit to fp16, yet the speed for TG is not related to the number of parameters or speed of the GPU, but the model size in RAM - at least for TG. Not a new insight, on llama.cpp there are conversations and graphs realated to this topic and Apple hardware. No wonder I get only 0.2 tokens/s for the larger 70b parameter if only using DDR3 ECC RAM

PP and TG for Apple hardware

I found 20 tokens/s and faster a usable speed to use an LLM, and looking at the graph you see what hardware you will need. CPUs are out of the question. Both RX 6600 and RTX 3060 Ti have 8GB of RAM. I got the RX 6600 for $130 and the RTX 3060 Ti for $200. To get the same tokens/s that I have with 8b models also for a 70b model I would need a RTX A6000 Ada with 48 GB of RAM for $6000. And even that is by far not enough for a 405b model. Yet the possible accuracy would be nice:

accuracy

Measurements done by Meta.

Correlation Model Size and TG token generation

After some test runs with ollama, reading documentation and the results of other people running tests it seems like there is a simple relationship for the token generation speed $T$ from the RAM bandwidth $B$ in GB/s and the model size $M$ in RAM in GB.

\($ T = \frac{B}{M}\)$

The graph from the Apple Silicon above seems to be not linear above 400 GB/s. Anandtech tested the memory bandwidth and found that the CPU can’t use all the memory bandwidth (M1 128bit wide, M2 Pro 256 bit wide, M4 Max 512 bit wide, M2 Ultra 1024 bit wide) since the 8 LPDDR5 128bit controller have to move the data across the chip to the GPU. See here the 4 controller on the M1 Max chip:

M1 MAX chip

The two M1 Max chips that are connected with some 10000 traces on the 2.5D chip packaging interposer for 2.5 TB/s bandwidth. This should be enough for the “just” 0.8 TB/s memory bandwidth, but maybe it’s not always as aligned as wanted, or a better driver would improve speed there. So that the GPU cores have their dedicated RAM segment to work on and little data has to be moved over the UltraFusion interface. Anandtech wrote about this technology in 2022. Another test in 2023 only saw 240 GB/s for the M2 Ultra - limit for the CPU?

History