nano-gpt

nanoGPT

Create a generative pre-trained transformer neural network following the video by Andrej Karpathy from January 2023. ChatGPT had just been released a few weeks earlier on November 30, 2022 by OpenAI. Andrej is one of the co-founders of OpenAI. The first model, GPT-1 was released in June 2018.

Source

Video: Let’s build GPT: from scratch, in code, spelled out. from January 2023 for infinite Shakespeare 1:56:19.

Papers

Hardware requirements

While the project can be compiled even on a CPU for best results you would use a GPU. But not any GPU, it needs to be from Nvidia to use the CUDA capabilities from torch. The software stack for AMD is still a WIP. And I have several Nvidia GPUs. To test your own setup under Windows with WSL and Ubuntu 22.04 LTS (don’t use 24.04 LTS since it has python 3.12 which is a little to new) with the following few lines:

pip install torch numpy transformers datasets tiktoken wandb tqdm
git clone https://github.com/karpathy/nanogpt
cd nanogpt
python data/shakespeare_char/prepare.py
python train.py config/train_shakespeare_char.py
python sample.py --out_dir=out-shakespeare-char

On my slightly older GTX 960 I get the warning after the training call:

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Found NVIDIA GeForce GTX 960 which is too old to be supported by the triton GPU compiler, which is used
as the backend. Triton only supports devices of CUDA Capability >= 7.0, but your device is of CUDA capability 5.2

Let’s see what I have and what CUDA capabilities these support:

GPU name CUDA cores Compute Capability at architecture RAM GB
Quadro FX580 32 1.1 hp Z600 Tesla (2006) 0.5
GTX 650 384 3.0 hp Z600 Kepler (2012) 1
Jetson Nano 128 5.3   Maxwell (2014) 4
GT750M 384 3.0 MBPr15 2014 Kepler (2012) 0.5
M1000M 512 5.0 Zbook 15 G3 Kepler (2012) 1
GTX960 1024 5.2 E5-2696 v3 Maxwell (2014) 2
T4 2560 7.5 Google Colab Turing (2018) 16
RTX3070 Ti 6144 8.6 i3 10100 Ampere (2020) 8

Only one of 7 is supported by the Triton GPU compiler. How about a newer GPU?

GeForce series CUDA Architecture Process Year
900 5.2 Maxwell 28HP 2014
10 6.1 Pascal 16FF 2016
16 7.5 Turing 12FFN 2018
20 7.5 Turing 12FFN 2018
30 8.6 Ampere 8LPP 2020
40 8.9 Ada Lovelace 4N 2022
50 10.0 Blackwell 4NP 2024

Looks like at least series 16 or 20, but probably 30 to be sure when future compilers increase to 8.0.

History

This is just for my own sanity. A brief overview of the development of machine learning ML, image classification, object detection, self-driving car expectations, generative pre-trained transformer GPT, AI, AGI, general and generative AI. The first GPT was published by OpenAI in 2018 following the 2017 paper Attention Is All You Need.

Manual labeling images for Machine Learning - 2007 Image Net

Fei Fei Li explains in her TET talk in 2015 that her group at Stanford started in 2007 to classify around 1,000,000,000 pictures and used Amazon Mechanical Turk with 48,940 workers in 167 countries to create the image database ImageNet. By 2009 the database https://www.image-net.org/ had 14,197,122 labeled images in 21,841 categories. And funding back then was a problem!

ILSVRC competition - 2010

The initial approach was to look for features manually in the pictures (hard-coded) to classify them correctly. The top-5 error rate in 2010 was 28.2% with this approach. A deep neural network for deep learning would make big progress 2 years later!

2015 ILSVRC

Progress in Image Classification - 2012 AlexNet with a convolutional neural network - CNN

To win the ImageNet 2012 challenge AlexNet used GPUs for its concolutional neural network CNN. Its running on CUDA. This competition ILSVRC started in 2010 and ran until 2017. The creator of ImageNet Fei Fei Li gave an inspiring TED talk in 2015 including the AlexNet solution. Self driving cars are mentioned at minute 1:30. And how much a three-year old can outperform a computer. And Elon Musk is talking since 2014 that autonomous driving is just a year away.

AlexNet 2012

Deep Learning for Computer Vision - 2016 Andrej Karpathy

A great talk about Deep Learning by Andrej Karpathy from September 25, 2016 explains in detail and clarity the different layers in AlexNet and further develpments. Here is an overview of the layers for AlexNet, and a benchmark tests the FLOPS for each layer with your current GPU is shown below:

layer size architecture memory parameter
INPUT 227x227x3   154,587 0
CONV1 55x55x96 96 11x11 filters at stride 4, pad 0 290,400 11,616
MAX POOL1 27x27x96 3x3 filters at stride 2 69,984 0
NORM1 27x27x96 Normalization layer 69,984 0
CONV2 27x27x256 256 5x5 filters at stride 1, pad 2 186,624 6,400
MAX POOL2 13x13x256 3x3 filters at stride 2 43,264 0
NORM 2 13x13x256 Normalization layer 43,264 0
CONV3 13x13x384 384 3x3 filters at stride 1, pad 1 64,896 3,456
CONV4 13x13x384 384 3x3 filters at stride 1, pad 1 64,896 3,456
CONV5 13x13x256 256 3x3 filters at stride 1, pad 1 43,264 2,304
MAX POOL3 6x6x256 3x3 filters at stride 2 9,216 0
FC6 4096   4,096 0
FC7 4096   4,096 0
FC8 1000 100 neurons (class scores) 1,000 0
      1,049,571 27,232

Benchmark run calflops (and with this model far from the 22 TFLOPS possible):

(venv) mk@i3:~/test$ python pytorch-calflops.py

------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  61.1 M
fwd MACs:                                                               714.188 MMACs
fwd FLOPs:                                                              1.4297 GFLOPS
fwd+bwd MACs:                                                           2.1426 GMACs
fwd+bwd FLOPs:                                                          4.2892 GFLOPS
---------------------------------------------------------------------------------------------------
Alexnet FLOPs:1.4297 GFLOPS   MACs:714.188 MMACs   Params:61.1008 M

Source material from Stanford:

Object detection - 2017 YOLO

The TED Talk from August 2017 by Joseph Redmon from Washington University inspired ideas and possibilities. In his talk he talked about the application for self driving cars. It certainly makes it imaginable, and it was just running on his laptop!

The transformer model of machine learning and neuronal networks - June 2017

First GPT released by OpenAI - June 2018

OpenAI released its first Generative Pre-trained Transformer 1 (GPT-1) following the transformer architecture published by Google in 2017. https://openai.com/index/language-unsupervised/

Application in diploma thesis in Saigon - 2018

A student at HCMUTE used this YOLO (You Only Look Once) software with an NVidia graphics card in his master thesis 2018 for an part in developing an autonomous car. But the GTX 980 with 2048 CUDA cores need 165 Watt power to work. The power requirement is a challenge for a mobile application.

Tensorflow.lite and posenet on new RX 470 GPU - March 2019

The crypto-boom was over and graphics cards available again. So I started with an RX 470 in March 2019. In October 2018 I had already successfully installed Darknet on Ubuntu. The new repository https://github.com/kreier/ml

Mobile affordable platform - 2019 Jetson Nano

Nvidia had the Jetson TK1 platform already created in 2014, but it was power hungry and expensive. But for just 100 Dollar the Jetson Nano was announced in 2019 to be affordable for student projects with just using 5-10 Watt. We applied for a project at hackster.io. But ultimately we ordered a 4GB development platfrom at the end of 2019 for student projects.

The drive base from TAE was delayed early 2020 because of the starting COVID-19 pandemic and shipments from China. But eventually we got the drive base. Yet the project https://github.com/kreier/jetson-car stalled from 2020-2024.

Jetson and car

OpenAI releases ChatGPT and the world takes note - November 2022

The underlying model GPT-3.5 was already released in March 2022. But in order to make it an assistant you can talk to some finetuning had to be done. See blog post https://openai.com/index/chatgpt/

Finetuning ChatGPT

And ever since the interface was available on the web with ChatGPT since November 2022 the world took note. It ran in news outlets and personal conversations worldwide and within two years NVIDIA became the most valuable company in the world.

Challenges of self driving - 2024

It seemed like the combination of object detection and classification was the main problem to get self driving cars possible, together with latency. As it turns out, there is much more to do. Additinal sensors like LIDAR really help, even Tesla finally gave in. And here are some example videos to show whats possible and what is still challenging:

Compute power needed for GPTs

First run of bigram.py at 0:41:44 with 3000 iterations in just a few seconds

batch_size = 32
block_size = 8
max_iters = 3000
learning_rate = 1e-2

Then introducing several improvements:

Second run at 1:21:28 with train loss 2.39 and val loss 2.41. Still very fast, just a few seconds. Here are some parameters:

learning_rate = 1e-3
n_embd = 32

Multi-Head Attention at 1:24:24 improves to train loss 2.27 and val loss 2.28.

Improved training: The adjusted parameters (residual connections, skip and layernorm) are in the lines 144 introduce n_layer and n_head, 145 pull out LayerNorm, 113 include dropout, 1:39:34 hyperparameters batch_size = 64, block_size = 256, learning_rate = 3e-4, n_embd = 384, n_head = 6 so each head has 384/6 = 64 dimensions, n_layer = 6 to have a val loss 1.48. In overview:

batch_size = 64       # from 32
block_size = 256      # from 8
learning_rate = 3e-4  # from 1e-3
n_embd = 384          # from 32

According to 1:40:46 this leads to 15 minutes on a A100 GPU. Let’s assume its a 40 GB version with 156 TFLOPS on TF32 this needed 1.4e17 FLOP. If not constrained by having only 8GB of GDDR6X RAM then this should finish on my 3070 Ti after 6100 seconds or 1h40. Or with a price of $2.00 per hour for an A100 I could just rent it for $0.50 to do the calculation.

Memory won’t be the limiting factor, as tested with bigram.py on both CPU and GPU. And the script does not explicitly uses TF32 (only 10 bits fraction/significand/mantissa) but FP32 (23 bits). That’s three more bits for fraction than BF16 (7 bits), while all three (FP32, TF32, BF16) have 1 sign bit and 8 bit exponent. The FP32 power of the A100 is 19.49 TFLOPS, so for 15 min = 900 seconds the needed compute is 1.75e16. Having 21.75 TFLOPS FP32 on my 3070 Ti the training run of 5000 iterations should take 800 seconds or 13 minutes.

Update 2024/07/24: The training run python train.py config/train_shakespeare_char.py actually needs just needs 27 seconds to compile, another 24 seconds from the first step to the first iteration. And with iteration times for around 45 ms all 5000 iterations to a final loss of 0.82 (step 5000: train loss 0.62, val loss 1.69) need 6:51 minutes or 411 seconds. GPU dedicated RAM increased from 0.6 GB to 2.7 GB, some 2.1 GB are needed for this exercise. Output from prepare:

mk@i3:~/nanogpt$ python data/shakespeare_char/prepare.py
length of dataset in characters: 1,115,394
all the unique characters:
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1,003,854 tokens
val has 111,540 tokens

And training:

mk@i3:~/nanogpt$ python train.py config/train_shakespeare_char.py
dataset = 'shakespeare_char'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000

tokens per iteration will be: 16,384
found vocab_size = 65 (inside data/shakespeare_char/meta.pkl)
Initializing a new model from scratch
number of parameters: 10.65M
num decayed parameter tensors: 26, with 10,740,096 parameters
num non-decayed parameter tensors: 13, with 4,992 parameters
using fused AdamW: True

step 5000: train loss 0.6210, val loss 1.6979
iter 5000: loss 0.8202, time 7435.02ms, mfu 7.53%

Update 2024/07/29: Andrej mentions that he uses a A100 GPU. In the video the time needed is 15 minutes, on the website for nanoGPT it is down to just 3 minutes. My 3070 Ti is only half as fast. But there is a cheaper option: in Google Colab you can use a GPU instance called T4. That’s a Nvidia Tesla T4 data center GPU with 16 GB RAM and Turing Microarchitecture (CUDA 7.5 and therefore compatible!). It does not support bf16 and the CUDA compiler Triton throws some errors, but you can just add two lines to the top of the train.py and it compiles and trains!

import torch._dynamo
torch._dynamo.config.suppress_errors = True

We don’t get the impressive 130 TOPS INT8 performance, and only a fraction of the 8.1 TFLOPS FP32 performance since we share it. But the cycle times of 520 ms are not that bad, and after one hour (Google provides more than 2 hours runtime for free) the model is trained and ready to use! See my Jupyter Notebook at colab.google or here at Github.

Energy efficiency

year Device TOPS Watt FP16 FP32 Price TOPS/Watt
2017 TPU v2 45 280 - -   0.16
2018 T4 130 70 64800 8100 900 1.86
2019 Coral TPU 8 4 - - 40 2.00
2021 3070 Ti 22 290 21750 21750 500 0.07
2023 L4 242 72 121000 30300 2500 3.36
2024 Grayskull e75 221 75     600 2.95
2024 Wormhole n300s 466 300     1400 1.55