📚 On This Page

Module 5: Building Your Own ChatGPT (Mini Version!)

Duration: Week 10-11
Difficulty: Advanced
Prerequisites: Module 4 completed

—

🎯 What You’ll Learn

Ever wondered how ChatGPT works? You’ll build a mini version that can write stories, complete sentences, and chat (sort of!)!

—

📝 Part 1: Tokenization – Teaching Computers to Read

The Problem

Computers only understand numbers, but we want to work with words!

Solution: Tokenization (Breaking text into pieces)

Three approaches:
1. Word-level: “Hello world” → [“Hello”, “world”]
– Problem: Too many unique words (millions!)

2. Character-level: “Hello” → [“H”, “e”, “l”, “l”, “o”]
– Problem: Sequences get very long

3. Subword (BPE): “Hello” → [“Hel”, “lo”]
– Best of both worlds! (This is what GPT uses)

Project: Build a BPE Tokenizer

Start with characters
Find most common pairs
Merge them into new tokens
Code it from scratch in ~50 lines!

—

💬 Part 2: Word Embeddings – Word Meanings as Numbers

The Idea

Words with similar meanings should have similar numbers!

King - Man + Woman ≈ Queen
(In number space, this actually works!)

Word2Vec – Learning Word Meanings

Idea: Words appearing together have related meanings
Example: “cat” often appears with “meow”, “fur”, “pet”
Algorithm: Predict surrounding words

Project: Train Word2Vec

Dataset: Wikipedia articles (10MB subset)
Training: 30 minutes
Demo: Find similar words

– Input: “king” → Output: “queen”, “prince”, “monarch”

—

🤖 Part 3: Language Models – Predicting Next Words

What is a Language Model?

Given: “The cat sat on the”
Predict: “mat” (or “floor”, “chair”, etc.)

How GPT Works (Simplified)

1. Convert words to numbers (tokenization + embeddings)
2. Use transformer to understand context
3. Predict probability for each possible next word
4. Pick the most likely one

Project: Build Mini-GPT

Architecture: Tiny GPT (10-20M parameters)
Dataset: TinyStories (simple children’s stories)
Training: 3-4 hours on CPU
Result: Generates short coherent stories!

Example output:

Prompt: "Once upon a time, there was a"
Generated: "little girl named Lily. She loved to play..."

—

⚡ ISL Optimization – Making It Fit in 16GB

1. Vocabulary Pruning

Full GPT: 50,000 tokens
Your model: 5,000 tokens (only common words)
Result: 10x less memory!

2. Gradient Accumulation

Problem: Want batch size 64, but only 8 fits in RAM
Solution: Process 8 at a time, accumulate gradients
Result: Same as batch size 64!

3. KV-Cache (For Fast Generation)

Don’t recalculate attention for words already processed
Save “key” and “value” matrices
Result: 10x faster text generation!

4. 8-bit Quantization

Store weights as 8-bit instead of 32-bit
Result: 4x smaller model

—

📚 Resources

Andrej Karpathy’s “Let’s build GPT”
TinyStories dataset
nanoGPT code
WikiText for word embeddings

—

✅ Learning Checklist

[ ] Build BPE tokenizer from scratch
[ ] Train Word2Vec embeddings
[ ] Implement mini-GPT architecture
[ ] Generate coherent text
[ ] Apply vocabulary pruning
[ ] Use gradient accumulation
[ ] Implement KV-cache optimization

—

🚀 Next Steps

Module 6: Advanced Computer Vision

Build ResNets, object detectors, and vision transformers!

📚 References & Further Reading

Dive deeper with these carefully selected resources:

📖 GPT-3 Paper

by Brown et al.
📖 Andrej Karpathy – Let’s build GPT

by Andrej Karpathy
📖 Hugging Face NLP Course

by Hugging Face

📝 Related Topics

→
How GPT Models Generate Text
→
Tokenization: BPE vs WordPiece
→
Word Embeddings: Word2Vec and GloVe

🎯 Module Resources

🗺️ Course Navigation

← Back to Course Overview