Skip to the content.

ML - machine learning

This is just a documentation of my learning progress.

2018 - Start with Object Detection

Inspired by object detection for cars with DarkNet (see this TED talk from 2017 by Joseph Redmon) and David’s bachelor work at HCMUTE in connection with a car at the end of 2018 I started to learn more about machine learning.

Posenet runs on TensorFlow.lite in a browser on WebGL even on a smartphone. We tested it in December 2018 in Seoul, Korea. In March 2019 I got TensorFlow.js running with my RX470 with 43 fps.

Posenet the park

During 2019 NVIDIA announced the Jetson Nano developer kit and with students from AISVN we tried to win one in a competition. Eventually we ordered a package.

Jetson Nano car

Early 2020 some supply chains delay orders, but we finally have the hardware. Now it needs to be combined - and development stalls until 2024.

Facemesh example

Schedule for 2020

In this article Harsheev Desai describes his journey to become a TensorFlow Developer with Certificate in 5 months.

1. Learn Python

2. Learn Machine Learning Theory

3. Learn Data Science Libraries

Some of these libraries are Pandas (data manipulation and analysis), Numpy (support for multi-dimensional arrays and matrices), Matplotlib (plotting) and Scikitlearn (creating ML models).

4. Deep Learning Theory

5. TensorFlow Certificate

One reason for tensorflow can be seen in this graph regarding popularity on stackoverflow:

More about the tensorflow certificate here on medium. It was launched in March 2020 but ceased to exist by 2024. The data science hype, once reflected on platforms like towardsdatascience and medium.com has subsided. The focus has shifted to innovations like transformers and ChatGPT, which have been the “hot new thing” since 2022. Nevertheless, I took the opportunity to learn Python, Pandas, NumPy, Matplotlib, Deep Learning, Machine Learning, and Neural Networks along the way. NumPy and Pandas have now surpassed tensorflow and pytorch on stackoverflow.

2022 - Teach ML in Advanced Automation at SSIS in Unit 5

As covered in a SSIS Stories in March 2022 we made great progress in creating our own Neural Network, Training it and then doing interference on them. See also our website.

If you think about possible learning experiences, we tried a few ones with our students:

2024 - start with LLMs

Andrej Karpathy offers a step-by-step guide to build your own Generative Pre-trained Transformer (GPT) starting with 1,000,000 characters from Shakespeare that you can train on your own GPU. Well, at least if it supports CUDA >7.0, otherwise the triton compiler throws an error (like on my slightly older GTX 960):

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Found NVIDIA GeForce GTX 960 which is too old to be supported by the triton GPU compiler, which is used
as the backend. Triton only supports devices of CUDA Capability >= 7.0, but your device is of CUDA capability 5.2

Let’s see what I have and what CUDA Compute Capabilities (CC) these support:

GPU name Cores CC at architecture RAM GB
Quadro FX580 32 1.1 hp Z600 Tesla (2006) 0.5
GTX 650 384 3.0 E3-1226 v3 Kepler (2012) 1
GT750M 384 3.0 MBPr15 2014 Kepler (2012) 0.5
M1000M 512 5.0 Zbook 15 G3 Kepler (2012) 1
GTX960 1024 5.2 E5-2696 v3 Maxwell (2014) 2
Jetson Nano 128 5.3   Maxwell (2014) 4
T4 2560 7.5 Google Colab Turing (2018) 16
RTX3060 Ti 4864 8.6 i7-8700 Ampere (2020) 8
RTX3070 Ti 6144 8.6 i3-10100 Ampere (2020) 8

Only two of 8 are supported by the Triton GPU compiler. How about a newer GPU? At least I can use the T4 in Google’s colaboratory for free. The training takes one hour. And you get two hours for free. Ollama only needs CUDA Compute Capability 5.0 and can therefore run on 5 of my graphic cards. Plus the RX 6600 with ROCm and a hack.

Triton Compatibility (supported hardware):

My AMD RX 470, RX 580 and RX 6600 are too old to be supported by ROCm, even though the 6600 already uses RNDA2. The RX 6600 can be used if the llvm target is overwritten to be gfx1030 instead of gfx1032. The ROCm installation needs 30 GB! In this regard it seems Nvidia has been ahead of the game for some time now with their proprietary CUDA since 2007. Support for the first Tesla GPUs with Compute Capability 1.1 was only dropped with CUDA SDK 7.0 in 2016. For the current CUDA SDK 12.0 (since 2022) a CC of 5.0 (Maxwell and newer since 2014) is required. That’s true for ollama, too. In 2024 that’s 10 year old hardware.

Inference on local hardware

In early 2023 I ran a 8b parameter model with a 4 bit quantization on my MacBook Pro at SSIS. It was impressive to see what’s possible with just 8GB of RAM on a laptop! It became obvious that you need more RAM for larger models, so I built a new workstation with 128 GB RAM and a 18-core E5-2696 v3 CPU in early 2024. Well, it became another learning experience:

performance

Turns out that the token creation rate is inversely proportional to the size of the model! Or the time to create a token for the answer (TG) is proportional to the RAM speed. A large model might fit into your RAM or VRAM, but the larger the model, the slower an answer will be. The above graph has quantization int4 to fp16, yet the speed for TG is not related to the number of parameters or speed of the GPU, but the model size in RAM - at least for TG. Not a new insight, on llama.cpp there are conversations and graphs related to this topic and Apple hardware. No wonder I get only 0.2 tokens/s for the larger 70b parameter if only using DDR3 ECC RAM. And that 4-bit quantized models are almost as precise as the full fp16 ones was tested in a paper 2023-02-28 (The case for 4-bit precision: k-bit Inference Scaling Laws). With a quarter the size you could fit a model with 4x the parameters in RAM, or get the same model to work 4x faster. Since RAM size and RAM speed are both expensive.

PP and TG for Apple hardware

I found 20 tokens/s and faster to be a usable speed to use an LLM, and looking at the graph you see what hardware you will need. CPUs are out of the question. Both RX 6600 and RTX 3060 Ti have 8GB of RAM. I got the RX 6600 for $130 and the RTX 3060 Ti for $200. To get the same tokens/s that I have with 8b models, but for a 70b model I would need a RTX 6000 Ada with 48 GB of RAM for $6000. And even that is by far not enough for a 405b model. Yet the possible accuracy would be nice:

accuracy

Measurements above are done by Meta.

Correlation Model Size and TG token generation

After some test runs with ollama in October 2024, reading documentation and the results of other people running tests it seems like there is a simple relationship for the token generation speed $T$ from the RAM bandwidth $B$ in GB/s and the model size $M$ in RAM in GB. I found an almost linear fit with a factor of 1.1, here simplified to 1:

This approximated linear relation can be seen in the Apple Silicon graph in the last paragraph, but it seems to be not linear above 400 GB/s. My experiments show an almost linear relationship, if you convert token/s back to time per token:

time per token

What’s with the M CPUs from Apple? Anandtech tested the memory bandwidth for the Ultra CPU and found that the CPU can’t use all the memory bandwidth (M1 128bit wide, M2 Pro 256 bit wide, M4 Max 512 bit wide, M2 Ultra 1024 bit wide). Maybe the reason is that the 8 LPDDR5 128bit controller have to move the data across the chip to the GPU in some instances. Here is a die picture just from the M1 Max chip, see how much area is used just for the memory controllers:

The two M1 Max chips that are connected with some 10000 traces on the 2.5D chip packaging interposer for 2.5 TB/s bandwidth. This should be enough for the “just” 0.8 TB/s memory bandwidth, but maybe it’s not always as aligned as wanted, or a better driver would improve speed there. So that the GPU cores have their dedicated RAM segment to work on and little data has to be moved over the UltraFusion interface. Anandtech wrote about this technology in 2022. Another test in 2023 only saw 240 GB/s for the M2 Ultra - limit for the CPU? Anyway, here my findings for memory speed in a table:

CPU Memory GByte/s x GPU Memory GByte/s
E3-1226 v3 DDR3 1333 22   Jetson Nano 4GB LPDDR4 64bit 25
i7-8700 DDR4 2666 35   Quadro M1000M 2GB GDDR5 128bit 80
i7-13700T DDR4 3200 128bit 43   P106-100 6GB GDDR5 192bit 192
Apple M1 LPDDR4X 4266 66   RTX 3070 Ti 8 GB GDDR6X 256bit 608
M3 Max LPDDR5 6400 512bit 409   P100 16 GB HBM2 4096bit 732
M4 Max LPDDR5X 8533 546   RTX 6000 Ada 48 GB GDDR6 384bit 960

And while news to me, this very limit of the response time in LLMs is long known in the industry. And there are some novel ideas on how to circumvent the “latency bottleneck”.

Faster inference with speculative execution

Just reading the process and analyzing my findings this approach seems obvious. For one token the entire model has to be loaded from the VRAM into the cache of the GPU and processed. But most of the time the GPU is just waiting for new data to arrive. If we had a good guess for the next token, we could process the extended prompt at the same time with no measurable increased time to generate a token, but we would have 2 tokens generated! Here are some papers about this:

The above papers indicate that 2x or even 3x would be possible. I think you need very good conditions to achive that result - but from a probablility standpoint. One factor is that the speculation of further token takes up a considerable amount of time, and you want it to be fast (and small). On the other hand you want to have a high success rate. I played around with some parameters in a Google Sheet and set the success rate to 90% and made the speculative model 20x faster/smaller. In this case the large model produces a token every 100ms, and the speculative model every 5 ms. Here is the result:

I can’t get to 2x with these values. I would need a much smaller model that is 50x smaller/faster than the large model to hit 2.37x.

Lessons learned so far

History