ML - machine learning
This is just a documentation of my learning progress.
2018 - Start with Object Detection
Inspired by object detection for cars with DarkNet (see this TED talk from 2017 by Joseph Redmon) and David’s bachelor work at HCMUTE in connection with a car at the end of 2018 I started to learn more about machine learning.
Posenet runs on TensorFlow.lite in a browser on WebGL even on a smartphone. We tested it in December 2018 in Seoul, Korea. In March 2019 I got TensorFlow.js running with my RX470 with 43 fps.
During 2019 NVIDIA announced the Jetson Nano developer kit and with students from AISVN we try to win one in a competition. Eventually we order a package.
Early 2020 some supply chains delay orders, but we finally have the hardware. Now it needs to be combined - and development stalls until 2024.
Facemesh example
Schedule for 2020
In this article Harsheev Desai describes his journey to become a TensorFlow Developer with Certificate in 5 months.
1. Learn Python
2. Learn Machine Learning Theory
- Coursera Machine Learning on Statistics, Calculus and Linear Algebra
3. Learn Data Science Libraries
Some of these libraries are Pandas (data manipulation and analysis), Numpy (support for multi-dimensional arrays and matrices), Matplotlib (plotting) and Scikitlearn (creating ML models).
4. Deep Learning Theory
5. TensorFlow Certificate
One reason for tensorflow can be seen in this graph regarding popularity on stack overflow:
More about the certificate here on medium. It was introduced in March 2020 but by 2024 it no longer exists.
2022 - Teach ML in Advanced Automation at SSIS in Unit 5
As covered in a SSIS Stories in March 2022 we made great progress in creating our own Neural Network, Training it and then doing interference on them. See also our website.
If you think about possible learning experiences, we tried a few ones with our students:
- Create your own neural network, generate training data, train your model (with loss and test) and then use the trained model (inference). It was part of the SSIS course Advanced Automation https://github.com/ssis-aa/machine-learning
- Image classification: Select training data (for example of seagulls) and train a ML model in xcode on your Mac to properly identify your test set of images. It was part of the SSIS course App Development
- Build your own GPT. A phantastic course by Andrej Karpathy with his nanoGPT model guides you in a 2-hour video to create endless Shakespeare. With Googles offerings inside a Colab Jupyter notebook you can train your model without a GPU just in the cloud for free.
- Run your local LLM. With Meta providing the weights for their llama model with 8b, 70b and 405b parameter it is possible in 2024 to run a LLM on your local CPU or GPU. OF course there are some limitations in speed and VRAM size, but that’s part of the learning. Ollama is a good starting point.
- Update your local LLM with RAG (Retrieval Augmented Generation) with links to your documents in
open-webui/backend/data/docs
2024 - start with LLMs
Andrej Karpathy offers a step-py-step guide to build your own Generative Pre-trained Transformer (GPT) starting with 1,000,000 characters from Shakespeare that you can train on your own GPU. Well, at least if it supports CUDA >7.0, otherwise the triton compiler throws an error (like on my slightly older GTX 960):
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Found NVIDIA GeForce GTX 960 which is too old to be supported by the triton GPU compiler, which is used
as the backend. Triton only supports devices of CUDA Capability >= 7.0, but your device is of CUDA capability 5.2
Let’s see what I have and what CUDA Compute Capabilities (CC) these support:
GPU name | Cores | CC | at | architecture | RAM GB |
---|---|---|---|---|---|
Quadro FX580 | 32 | 1.1 | hp Z600 | Tesla (2006) | 0.5 |
GTX 650 | 384 | 3.0 | E3-1226 v3 | Kepler (2012) | 1 |
GT750M | 384 | 3.0 | MBPr15 2014 | Kepler (2012) | 0.5 |
M1000M | 512 | 5.0 | Zbook 15 G3 | Kepler (2012) | 1 |
GTX960 | 1024 | 5.2 | E5-2696 v3 | Maxwell (2014) | 2 |
Jetson Nano | 128 | 5.3 | Maxwell (2014) | 4 | |
T4 | 2560 | 7.5 | Google Colab | Turing (2018) | 16 |
RTX3060 Ti | 4864 | 8.6 | i7-8700 | Ampere (2020) | 8 |
RTX3070 Ti | 6144 | 8.6 | i3-10100 | Ampere (2020) | 8 |
Only two of 8 are supported by the Triton GPU compiler. How about a newer GPU? At least I can use the T4 in Google’s collaboratory for free. The training taikes one hour. And you get two hours for free. Ollama only needs CUDA Compute Capability 5.0 and can therefore run on 5 of my graphic cards. Plus the RX 6600 with ROCm and a hack.
Triton Compatibility (supported hardware):
- NVIDIA GPUs (Compute Capability 7.0+)
- AMD GPUs (ROCm 5.2+)
- Under development: CPUs
My AMD RX 470, RX 580 and RX 6600 are too old to be supported by ROCm, even though the 6600 already uses RNDA2. The RX 6600 can be used if the llvm target is overwritten to be gfx1030 instead of gfx1032. The ROCm installation needs 30 GB! In this regard it seems Nvidia is ahead of the game for some time now with their proprietary CUDA since 2007. Support for the first Tesla GPUs with Compute Capability 1.1 was only dropped with CUDA SDK 7.0 in 2016. For the current CUDA SDK 12.0 (since 2022) a CC of 5.0 (Maxwell and newer since 2014) is required. That’s true for ollama, too. In 2024 that’s 10 year old hardware.
Inference on local hardware
In early 2023 I ran a 8b model with a 4bit quantization on my MacBook Pro at SSIS. It was impressive to see what’s possible with 8GB of RAM on a laptop! It became obvious that you need more RAM for larger models, so I build a new workstation with 128 GB RAM and a 18-core E5-2696 v3 CPU in early 2024. Well, it became another learning experience:
Turns out that the token creation rate is inverse proportional to the size of the model! Or the time to create a token for the answer (TG) is proportional to the RAM speed. A large model might fit into your RAM or VRAM, but the larger the model, the slower an answer will be. The above graph has quantization 4 bit to fp16, yet the speed for TG is not related to the number of parameters or speed of the GPU, but the model size in RAM - at least for TG. Not a new insight, on llama.cpp there are conversations and graphs realated to this topic and Apple hardware. No wonder I get only 0.2 tokens/s for the larger 70b parameter if only using DDR3 ECC RAM
I found 20 tokens/s and faster a usable speed to use an LLM, and looking at the graph you see what hardware you will need. CPUs are out of the question. Both RX 6600 and RTX 3060 Ti have 8GB of RAM. I got the RX 6600 for $130 and the RTX 3060 Ti for $200. To get the same tokens/s that I have with 8b models also for a 70b model I would need a RTX A6000 Ada with 48 GB of RAM for $6000. And even that is by far not enough for a 405b model. Yet the possible accuracy would be nice:
Measurements done by Meta.
Correlation Model Size and TG token generation
After some test runs with ollama, reading documentation and the results of other people running tests it seems like there is a simple relationship for the token generation speed $T$ from the RAM bandwidth $B$ in GB/s and the model size $M$ in RAM in GB.
\($ T = \frac{B}{M}\)$
The graph from the Apple Silicon above seems to be not linear above 400 GB/s. Anandtech tested the memory bandwidth and found that the CPU can’t use all the memory bandwidth (M1 128bit wide, M2 Pro 256 bit wide, M4 Max 512 bit wide, M2 Ultra 1024 bit wide) since the 8 LPDDR5 128bit controller have to move the data across the chip to the GPU. See here the 4 controller on the M1 Max chip:
The two M1 Max chips that are connected with some 10000 traces on the 2.5D chip packaging interposer for 2.5 TB/s bandwidth. This should be enough for the “just” 0.8 TB/s memory bandwidth, but maybe it’s not always as aligned as wanted, or a better driver would improve speed there. So that the GPU cores have their dedicated RAM segment to work on and little data has to be moved over the UltraFusion interface. Anandtech wrote about this technology in 2022. Another test in 2023 only saw 240 GB/s for the M2 Ultra - limit for the CPU?
History
- October 2018 Successful installed darknet on ubuntu, object detection works for stills. Don’t have a webcam, video does not work yet.
- December 2018 TensorFlow.lite in a browser on my iPhone 7 runs at 6 fps, demonstrated in Seoul
- March 2019 posenet runs in the browser with new RX470 with 43 fps
- December 2019 On hackster.io starts a new competition AI at the Edge Challenge where you can win a Jetson Nano. I apply and eventually just buy one from arrow
- February 2020 The Jetson car is purchased, Wifi module and 7” display as well. Needs completion - without students due to COVID-19
- July 2024 Reactivated the https://kreier.github.io/jetson-car/ project. The hardware is from 2019 (NVIDIA) but the software is still Ubuntu 18.04 LTS. Updates brake simple things like
make
andgcc
. - August 2024 Started to work on https://kreier.github.io/nano-gpt/ to learn more about LLMs, following Andrej Karpathy’s project https://github.com/karpathy/nanogpt
- December 2024 Local Proxmox server with i7-8700 and RTX 3060 Ti running llama3.1:8b in ollama over open-webui and tailscale