Module 5: Building Your Own ChatGPT (Mini Version!)

πŸ“š On This Page

Module 5: Building Your Own ChatGPT (Mini Version!)

Duration: Week 10-11
Difficulty: Advanced
Prerequisites: Module 4 completed

🎯 What You’ll Learn

Ever wondered how ChatGPT works? You’ll build a mini version that can write stories, complete sentences, and chat (sort of!)!

πŸ“ Part 1: Tokenization – Teaching Computers to Read

The Problem

Computers only understand numbers, but we want to work with words!

Solution: Tokenization (Breaking text into pieces)

Three approaches:
1. Word-level: “Hello world” β†’ [“Hello”, “world”]
– Problem: Too many unique words (millions!)

2. Character-level: “Hello” β†’ [“H”, “e”, “l”, “l”, “o”]
– Problem: Sequences get very long

3. Subword (BPE): “Hello” β†’ [“Hel”, “lo”]
– Best of both worlds! (This is what GPT uses)

Project: Build a BPE Tokenizer

  • Start with characters
  • Find most common pairs
  • Merge them into new tokens
  • Code it from scratch in ~50 lines!

πŸ’¬ Part 2: Word Embeddings – Word Meanings as Numbers

The Idea

Words with similar meanings should have similar numbers!

King - Man + Woman β‰ˆ Queen
(In number space, this actually works!)

Word2Vec – Learning Word Meanings

  • Idea: Words appearing together have related meanings
  • Example: “cat” often appears with “meow”, “fur”, “pet”
  • Algorithm: Predict surrounding words

Project: Train Word2Vec

  • Dataset: Wikipedia articles (10MB subset)
  • Training: 30 minutes
  • Demo: Find similar words

– Input: “king” β†’ Output: “queen”, “prince”, “monarch”

πŸ€– Part 3: Language Models – Predicting Next Words

What is a Language Model?

Given: “The cat sat on the”
Predict: “mat” (or “floor”, “chair”, etc.)

How GPT Works (Simplified)

1. Convert words to numbers (tokenization + embeddings)
2. Use transformer to understand context
3. Predict probability for each possible next word
4. Pick the most likely one

Project: Build Mini-GPT

  • Architecture: Tiny GPT (10-20M parameters)
  • Dataset: TinyStories (simple children’s stories)
  • Training: 3-4 hours on CPU
  • Result: Generates short coherent stories!

Example output:

Prompt: "Once upon a time, there was a"
Generated: "little girl named Lily. She loved to play..."

⚑ ISL Optimization – Making It Fit in 16GB

1. Vocabulary Pruning

  • Full GPT: 50,000 tokens
  • Your model: 5,000 tokens (only common words)
  • Result: 10x less memory!

2. Gradient Accumulation

  • Problem: Want batch size 64, but only 8 fits in RAM
  • Solution: Process 8 at a time, accumulate gradients
  • Result: Same as batch size 64!

3. KV-Cache (For Fast Generation)

  • Don’t recalculate attention for words already processed
  • Save “key” and “value” matrices
  • Result: 10x faster text generation!

4. 8-bit Quantization

  • Store weights as 8-bit instead of 32-bit
  • Result: 4x smaller model

πŸ“š Resources

βœ… Learning Checklist

  • [ ] Build BPE tokenizer from scratch
  • [ ] Train Word2Vec embeddings
  • [ ] Implement mini-GPT architecture
  • [ ] Generate coherent text
  • [ ] Apply vocabulary pruning
  • [ ] Use gradient accumulation
  • [ ] Implement KV-cache optimization

πŸš€ Next Steps

Module 6: Advanced Computer Vision

Build ResNets, object detectors, and vision transformers!

πŸ“š References & Further Reading

Dive deeper with these carefully selected resources:

πŸ“ Related Topics

  • β†’
    How GPT Models Generate Text
  • β†’
    Tokenization: BPE vs WordPiece
  • β†’
    Word Embeddings: Word2Vec and GloVe