Implementation of Karpathy's Neural Networks: Zero to Hero Lecture Series

📘 Introduction

This blog post presents my detailed implementation of Andrej Karpathy’s Neural Networks: Zero to Hero YouTube lecture series and exercises in Jupyter Notebook. The articles delve deeply into each topic to ensure a thorough and robust understanding of neural networks. The lecture series covers neural networks (NNs) and demonstrates how to build them from scratch in code. It begins with the basics of backpropagation, then moves on to multi-layer perceptrons (MLPs), convolutional neural networks (CNNs), and ultimately builds up to modern deep neural networks, such as Large Language Models (LLMs) like generative pre-trained transformers (GPTs), and LLM tokenization via Byte Pair Encoding (BPE). The course also introduces and explains diagnostic tools for understanding neural network dynamics and performance. The primary focus is on language modeling (LM), as language models provide an excellent foundation for learning deep learning concepts, and most of the skills acquired here are immediately transferable to other areas of deep learning, such as computer vision (CV). The full project can be found on GitHub.

Four engines are built and leveraged in this lecture series: micrograd, makemore, gpt and minBPE. The 1st two engines are not meant to be too heavyweight of libraries with a billion switches and knobs. They should exist as a single hackable file, and are mostly intended for educational purposes. Python and PyTorch are the only requirements.

micrograd: A tiny autograd (automatic gradient) engine that implements backpropagation (reverse-mode autodiff) over a dynamically built DAG (Directed Acyclic Graph) and a small NNs library on top of it with a PyTorch-like API. It’s a minimalistic, scalar-valued, auto-differentiation (autodiff) engine in python.
makemore: makemore takes one text file as input, where each line is assumed to be one training thing, and generates more things like it. Under the hood, it is an autoregressive character-level language model, with a wide choice of models from bigrams all the way to a Transformer (exactly as seen in GPT). For example, we can feed it a database of names, and makemore will generate cool baby name ideas that all sound name-like, but are not already existing names. Or if we feed it a database of company names then we can generate new ideas for a name of a company. Or we can just feed it valid scrabble words and generate english-like babble.
```
"As the name suggests, makemore makes more."
```
gpt: Generative Pre-trained Transformer, otherwise known as GPT, is a large language model (LLM) that is trained on a significant large size of text data to understand and generate human-like text sequentially. The “transformer” part refers to the model’s architecture, which was introduced and inspired by the 2017 “Attention Is All You Need” paper. GPTs are based on the transformer architecture, pre-trained on large data sets of unlabelled text, and able to generate novel human-like content.
minBPE: A minimal, clean implementation of the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The algorithm is based on the following paper: Neural Machine Translation of Rare Words with Subword Units. This tokenizer engine handles the crucial preprocessing step that converts raw text into tokens that language models can understand. BPE works at the byte level, processing UTF-8-encoded strings to efficiently handle a wide array of human languages and symbols. The minBPE tokenizer can train vocabulary and merges on text data, encode text to tokens, and decode tokens back to text - making it an essential component of the modern LLM pipeline used by models like GPT, Llama, and Mistral.

📘 Lecture Notebooks

The implementation of each lecture can be found below:

Lecture 1: micrograd
View Notebook →
Lecture 2: makemore 1 (Bigram model)
View Notebook →
Lecture 3: makemore 2 (Multi-Layer Perceptron)
View Notebook →
Lecture 4: makemore 3 (Batch Normalization)
View Notebook →
Lecture 5: makemore 4 (Backprop Ninja)
View Notebook →
Lecture 6: makemore 5 (WaveNet)
View Notebook →
Lecture 7: GPT (Transformer from scratch)
View Notebook →
Lecture 8: minBPE (GPT Tokenizer with Byte Pair Encoding [BPE])
View Notebook →