Create a generative pre-trained transformer neural network following the video by Andrej Karpathy from January 2023. ChatGPT had just been released a few weeks earlier on November 30, 2022 by OpenAI. Andrej is one of the co-founders of OpenAI. The first model, GPT-1 was released in June 2018.
Video: Let’s build GPT: from scratch, in code, spelled out. from January 2023 for infinite Shakespeare 1:56:19.
While the project can be compiled even on a CPU for best results you would use a GPU. But not any GPU, it needs to be from Nvidia to use the CUDA capabilities from torch
. The software stack for AMD is still a WIP. And I have several Nvidia GPUs. To test your own setup under Windows with WSL and Ubuntu 22.04 LTS (don’t use 24.04 LTS since it has python 3.12 which is a little to new) with the following few lines:
pip install torch numpy transformers datasets tiktoken wandb tqdm
git clone https://github.com/karpathy/nanogpt
cd nanogpt
python data/shakespeare_char/prepare.py
python train.py config/train_shakespeare_char.py
python sample.py --out_dir=out-shakespeare-char
On my slightly older GTX 960 I get the warning after the training call:
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Found NVIDIA GeForce GTX 960 which is too old to be supported by the triton GPU compiler, which is used
as the backend. Triton only supports devices of CUDA Capability >= 7.0, but your device is of CUDA capability 5.2
Let’s see what I have and what CUDA capabilities these support:
GPU name | CUDA cores | Compute Capability | at | architecture | RAM GB |
---|---|---|---|---|---|
Quadro FX580 | 32 | 1.1 | hp Z600 | Tesla (2006) | 0.5 |
GTX 650 | 384 | 3.0 | hp Z600 | Kepler (2012) | 1 |
Jetson Nano | 128 | 5.3 | Maxwell (2014) | 4 | |
GT750M | 384 | 3.0 | MBPr15 2014 | Kepler (2012) | 0.5 |
M1000M | 512 | 5.0 | Zbook 15 G3 | Kepler (2012) | 1 |
GTX960 | 1024 | 5.2 | E5-2696 v3 | Maxwell (2014) | 2 |
T4 | 2560 | 7.5 | Google Colab | Turing (2018) | 16 |
RTX3070 Ti | 6144 | 8.6 | i3 10100 | Ampere (2020) | 8 |
Only one of 7 is supported by the Triton GPU compiler. How about a newer GPU?
GeForce series | CUDA | Architecture | Process | Year |
---|---|---|---|---|
900 | 5.2 | Maxwell | 28HP | 2014 |
10 | 6.1 | Pascal | 16FF | 2016 |
16 | 7.5 | Turing | 12FFN | 2018 |
20 | 7.5 | Turing | 12FFN | 2018 |
30 | 8.6 | Ampere | 8LPP | 2020 |
40 | 8.9 | Ada Lovelace | 4N | 2022 |
50 | 10.0 | Blackwell | 4NP | 2024 |
Looks like at least series 16 or 20, but probably 30 to be sure when future compilers increase to 8.0.
This is just for my own sanity. A brief overview of the development of machine learning ML, image classification, object detection, self-driving car expectations, generative pre-trained transformer GPT, AI, AGI, general and generative AI. The first GPT was published by OpenAI in 2018 following the 2017 paper Attention Is All You Need.
Fei Fei Li explains in her TET talk in 2015 that her group at Stanford started in 2007 to classify around 1,000,000,000 pictures and used Amazon Mechanical Turk with 48,940 workers in 167 countries to create the image database ImageNet. By 2009 the database https://www.image-net.org/ had 14,197,122 labeled images in 21,841 categories. And funding back then was a problem!
The initial approach was to look for features manually in the pictures (hard-coded) to classify them correctly. The top-5 error rate in 2010 was 28.2% with this approach. A deep neural network for deep learning would make big progress 2 years later!
To win the ImageNet 2012 challenge AlexNet used GPUs for its concolutional neural network CNN. Its running on CUDA. This competition ILSVRC started in 2010 and ran until 2017. The creator of ImageNet Fei Fei Li gave an inspiring TED talk in 2015 including the AlexNet solution. Self driving cars are mentioned at minute 1:30. And how much a three-year old can outperform a computer. And Elon Musk is talking since 2014 that autonomous driving is just a year away.
A great talk about Deep Learning by Andrej Karpathy from September 25, 2016 explains in detail and clarity the different layers in AlexNet and further develpments. Here is an overview of the layers for AlexNet, and a benchmark tests the FLOPS for each layer with your current GPU is shown below:
layer | size | architecture | memory | parameter |
---|---|---|---|---|
INPUT | 227x227x3 | 154,587 | 0 | |
CONV1 | 55x55x96 | 96 11x11 filters at stride 4, pad 0 | 290,400 | 11,616 |
MAX POOL1 | 27x27x96 | 3x3 filters at stride 2 | 69,984 | 0 |
NORM1 | 27x27x96 | Normalization layer | 69,984 | 0 |
CONV2 | 27x27x256 | 256 5x5 filters at stride 1, pad 2 | 186,624 | 6,400 |
MAX POOL2 | 13x13x256 | 3x3 filters at stride 2 | 43,264 | 0 |
NORM 2 | 13x13x256 | Normalization layer | 43,264 | 0 |
CONV3 | 13x13x384 | 384 3x3 filters at stride 1, pad 1 | 64,896 | 3,456 |
CONV4 | 13x13x384 | 384 3x3 filters at stride 1, pad 1 | 64,896 | 3,456 |
CONV5 | 13x13x256 | 256 3x3 filters at stride 1, pad 1 | 43,264 | 2,304 |
MAX POOL3 | 6x6x256 | 3x3 filters at stride 2 | 9,216 | 0 |
FC6 | 4096 | 4,096 | 0 | |
FC7 | 4096 | 4,096 | 0 | |
FC8 | 1000 | 100 neurons (class scores) | 1,000 | 0 |
1,049,571 | 27,232 |
Benchmark run calflops (and with this model far from the 22 TFLOPS possible):
(venv) mk@i3:~/test$ python pytorch-calflops.py
------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.
Total Training Params: 61.1 M
fwd MACs: 714.188 MMACs
fwd FLOPs: 1.4297 GFLOPS
fwd+bwd MACs: 2.1426 GMACs
fwd+bwd FLOPs: 4.2892 GFLOPS
---------------------------------------------------------------------------------------------------
Alexnet FLOPs:1.4297 GFLOPS MACs:714.188 MMACs Params:61.1008 M
Source material from Stanford:
The TED Talk from August 2017 by Joseph Redmon from Washington University inspired ideas and possibilities. In his talk he talked about the application for self driving cars. It certainly makes it imaginable, and it was just running on his laptop!
OpenAI released its first Generative Pre-trained Transformer 1 (GPT-1) following the transformer architecture published by Google in 2017. https://openai.com/index/language-unsupervised/
A student at HCMUTE used this YOLO (You Only Look Once) software with an NVidia graphics card in his master thesis 2018 for an part in developing an autonomous car. But the GTX 980 with 2048 CUDA cores need 165 Watt power to work. The power requirement is a challenge for a mobile application.
The crypto-boom was over and graphics cards available again. So I started with an RX 470 in March 2019. In October 2018 I had already successfully installed Darknet on Ubuntu. The new repository https://github.com/kreier/ml
Nvidia had the Jetson TK1 platform already created in 2014, but it was power hungry and expensive. But for just 100 Dollar the Jetson Nano was announced in 2019 to be affordable for student projects with just using 5-10 Watt. We applied for a project at hackster.io. But ultimately we ordered a 4GB development platfrom at the end of 2019 for student projects.
The drive base from TAE was delayed early 2020 because of the starting COVID-19 pandemic and shipments from China. But eventually we got the drive base. Yet the project https://github.com/kreier/jetson-car stalled from 2020-2024.
The underlying model GPT-3.5 was already released in March 2022. But in order to make it an assistant you can talk to some finetuning had to be done. See blog post https://openai.com/index/chatgpt/
And ever since the interface was available on the web with ChatGPT since November 2022 the world took note. It ran in news outlets and personal conversations worldwide and within two years NVIDIA became the most valuable company in the world.
It seemed like the combination of object detection and classification was the main problem to get self driving cars possible, together with latency. As it turns out, there is much more to do. Additinal sensors like LIDAR really help, even Tesla finally gave in. And here are some example videos to show whats possible and what is still challenging:
8.6e19
FLOP1.5e21
FLOP (in 2019 some $100k, possible in July 2024 for $672)3.1e23
FLOP2.1e25
FLOP estimated1.9e18
FLOP - not even enough for GPT-1 (according to test with OpenCL), that would require 45 days!1.75e16
(see below) or 800 seconds on my 3070 TiFirst run of bigram.py
at 0:41:44 with 3000 iterations in just a few seconds
batch_size = 32
block_size = 8
max_iters = 3000
learning_rate = 1e-2
Then introducing several improvements:
Second run at 1:21:28 with train loss 2.39
and val loss 2.41
. Still very fast, just a few seconds. Here are some parameters:
learning_rate = 1e-3
n_embd = 32
Multi-Head Attention at 1:24:24 improves to train loss 2.27
and val loss 2.28
.
Improved training: The adjusted parameters (residual connections, skip and layernorm) are in the lines 144
introduce n_layer and n_head, 145
pull out LayerNorm, 113
include dropout, 1:39:34 hyperparameters
batch_size = 64, block_size = 256, learning_rate = 3e-4, n_embd = 384, n_head = 6 so each head has 384/6 = 64 dimensions, n_layer = 6 to have a val loss 1.48
. In overview:
batch_size = 64 # from 32
block_size = 256 # from 8
learning_rate = 3e-4 # from 1e-3
n_embd = 384 # from 32
According to 1:40:46 this leads to 15 minutes on a A100 GPU. Let’s assume its a 40 GB version with 156 TFLOPS on TF32 this needed 1.4e17 FLOP. If not constrained by having only 8GB of GDDR6X RAM then this should finish on my 3070 Ti after 6100 seconds or 1h40. Or with a price of $2.00 per hour for an A100 I could just rent it for $0.50 to do the calculation.
Memory won’t be the limiting factor, as tested with bigram.py
on both CPU and GPU. And the script does not explicitly uses TF32 (only 10 bits fraction/significand/mantissa) but FP32 (23 bits). That’s three more bits for fraction than BF16 (7 bits), while all three (FP32, TF32, BF16) have 1 sign bit and 8 bit exponent. The FP32 power of the A100 is 19.49 TFLOPS, so for 15 min = 900 seconds the needed compute is 1.75e16
. Having 21.75 TFLOPS FP32 on my 3070 Ti the training run of 5000 iterations should take 800 seconds or 13 minutes.
Update 2024/07/24: The training run python train.py config/train_shakespeare_char.py
actually needs just needs 27 seconds to compile, another 24 seconds from the first step to the first iteration. And with iteration times for around 45 ms all 5000 iterations to a final loss of 0.82 (step 5000: train loss 0.62, val loss 1.69) need 6:51 minutes or 411 seconds. GPU dedicated RAM increased from 0.6 GB to 2.7 GB, some 2.1 GB are needed for this exercise. Output from prepare:
mk@i3:~/nanogpt$ python data/shakespeare_char/prepare.py
length of dataset in characters: 1,115,394
all the unique characters:
!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1,003,854 tokens
val has 111,540 tokens
And training:
mk@i3:~/nanogpt$ python train.py config/train_shakespeare_char.py
dataset = 'shakespeare_char'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters
# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2
learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
tokens per iteration will be: 16,384
found vocab_size = 65 (inside data/shakespeare_char/meta.pkl)
Initializing a new model from scratch
number of parameters: 10.65M
num decayed parameter tensors: 26, with 10,740,096 parameters
num non-decayed parameter tensors: 13, with 4,992 parameters
using fused AdamW: True
step 5000: train loss 0.6210, val loss 1.6979
iter 5000: loss 0.8202, time 7435.02ms, mfu 7.53%
Update 2024/07/29: Andrej mentions that he uses a A100 GPU. In the video the time needed is 15 minutes, on the website for nanoGPT it is down to just 3 minutes. My 3070 Ti is only half as fast. But there is a cheaper option: in Google Colab you can use a GPU instance called T4. That’s a Nvidia Tesla T4 data center GPU with 16 GB RAM and Turing Microarchitecture (CUDA 7.5 and therefore compatible!). It does not support bf16
and the CUDA compiler Triton throws some errors, but you can just add two lines to the top of the train.py
and it compiles and trains!
import torch._dynamo
torch._dynamo.config.suppress_errors = True
We don’t get the impressive 130 TOPS INT8 performance, and only a fraction of the 8.1 TFLOPS FP32 performance since we share it. But the cycle times of 520 ms are not that bad, and after one hour (Google provides more than 2 hours runtime for free) the model is trained and ready to use! See my Jupyter Notebook at colab.google or here at Github.
year | Device | TOPS | Watt | FP16 | FP32 | Price | TOPS/Watt |
---|---|---|---|---|---|---|---|
2017 | TPU v2 | 45 | 280 | - | - | 0.16 | |
2018 | T4 | 130 | 70 | 64800 | 8100 | 900 | 1.86 |
2019 | Coral TPU | 8 | 4 | - | - | 40 | 2.00 |
2021 | 3070 Ti | 22 | 290 | 21750 | 21750 | 500 | 0.07 |
2023 | L4 | 242 | 72 | 121000 | 30300 | 2500 | 3.36 |
2024 | Grayskull e75 | 221 | 75 | 600 | 2.95 | ||
2024 | Wormhole n300s | 466 | 300 | 1400 | 1.55 |