π On This Page
Module 5: Building Your Own ChatGPT (Mini Version!)
Duration: Week 10-11
Difficulty: Advanced
Prerequisites: Module 4 completed
—
π― What You’ll Learn
Ever wondered how ChatGPT works? You’ll build a mini version that can write stories, complete sentences, and chat (sort of!)!
—
π Part 1: Tokenization – Teaching Computers to Read
The Problem
Computers only understand numbers, but we want to work with words!
Solution: Tokenization (Breaking text into pieces)
Three approaches:
1. Word-level: “Hello world” β [“Hello”, “world”]
– Problem: Too many unique words (millions!)
2. Character-level: “Hello” β [“H”, “e”, “l”, “l”, “o”]
– Problem: Sequences get very long
3. Subword (BPE): “Hello” β [“Hel”, “lo”]
– Best of both worlds! (This is what GPT uses)
Project: Build a BPE Tokenizer
- Start with characters
- Find most common pairs
- Merge them into new tokens
- Code it from scratch in ~50 lines!
—
π¬ Part 2: Word Embeddings – Word Meanings as Numbers
The Idea
Words with similar meanings should have similar numbers!
King - Man + Woman β Queen
(In number space, this actually works!)
Word2Vec – Learning Word Meanings
- Idea: Words appearing together have related meanings
- Example: “cat” often appears with “meow”, “fur”, “pet”
- Algorithm: Predict surrounding words
Project: Train Word2Vec
- Dataset: Wikipedia articles (10MB subset)
- Training: 30 minutes
- Demo: Find similar words
– Input: “king” β Output: “queen”, “prince”, “monarch”
—
π€ Part 3: Language Models – Predicting Next Words
What is a Language Model?
Given: “The cat sat on the”
Predict: “mat” (or “floor”, “chair”, etc.)
How GPT Works (Simplified)
1. Convert words to numbers (tokenization + embeddings)
2. Use transformer to understand context
3. Predict probability for each possible next word
4. Pick the most likely one
Project: Build Mini-GPT
- Architecture: Tiny GPT (10-20M parameters)
- Dataset: TinyStories (simple children’s stories)
- Training: 3-4 hours on CPU
- Result: Generates short coherent stories!
Example output:
Prompt: "Once upon a time, there was a"
Generated: "little girl named Lily. She loved to play..."
—
β‘ ISL Optimization – Making It Fit in 16GB
1. Vocabulary Pruning
- Full GPT: 50,000 tokens
- Your model: 5,000 tokens (only common words)
- Result: 10x less memory!
2. Gradient Accumulation
- Problem: Want batch size 64, but only 8 fits in RAM
- Solution: Process 8 at a time, accumulate gradients
- Result: Same as batch size 64!
3. KV-Cache (For Fast Generation)
- Don’t recalculate attention for words already processed
- Save “key” and “value” matrices
- Result: 10x faster text generation!
4. 8-bit Quantization
- Store weights as 8-bit instead of 32-bit
- Result: 4x smaller model
—
π Resources
- Andrej Karpathy’s “Let’s build GPT”
- TinyStories dataset
- nanoGPT code
- WikiText for word embeddings
—
β Learning Checklist
- [ ] Build BPE tokenizer from scratch
- [ ] Train Word2Vec embeddings
- [ ] Implement mini-GPT architecture
- [ ] Generate coherent text
- [ ] Apply vocabulary pruning
- [ ] Use gradient accumulation
- [ ] Implement KV-cache optimization
—
π Next Steps
Module 6: Advanced Computer Vision
Build ResNets, object detectors, and vision transformers!
π References & Further Reading
Dive deeper with these carefully selected resources:
-
π GPT-3 Paper
by Brown et al.
-
π Andrej Karpathy – Let’s build GPT
by Andrej Karpathy
-
π Hugging Face NLP Course
by Hugging Face
π Related Topics
-
β
How GPT Models Generate Text -
β
Tokenization: BPE vs WordPiece -
β
Word Embeddings: Word2Vec and GloVe