📚 On This Page

Module 6: Advanced Computer Vision – See Like AI

Duration: Week 12-13
Difficulty: Advanced
Prerequisites: Module 4 completed

—

🎯 What You’ll Learn

Build AI that can identify objects, detect faces, and understand images like self-driving cars do!

—

🏗️ Part 1: ResNet – Going Deeper Without Breaking

The Problem

Deeper networks should be better, but they often perform WORSE!
Why? Gradients vanish (become too small to learn)

ResNet’s Solution: Skip Connections

Regular: Input → Layer1 → Layer2 → Layer3 → Output
ResNet:  Input → Layer1 → Layer2 → Layer3 → Output
                  ↓_____________↑ (skip connection!)

Analogy: Like having shortcuts in a maze

Information can flow directly
Easier to learn
Can build 50-100 layer networks!

Project: Build ResNet-18

18 layers deep
Dataset: CIFAR-10 or your own photos
Goal: 80%+ accuracy
Training: 1-2 hours

—

🎨 Part 2: Data Augmentation – Getting More from Less

The Problem

Need 1000s of images, but only have 100?

Solution: Transform existing images!

Transformations:
1. Flip: Mirror image horizontally
2. Rotate: Turn 10-15 degrees
3. Crop: Random sections
4. Color: Adjust brightness, contrast
5. Noise: Add small random changes

Result: 100 images → 1000s of variations!

Project: Augmentation Pipeline

Use Albumentations library
Create 10 variations of each image
Compare: With vs without augmentation
Result: 10-20% better accuracy!

—

🎯 Part 3: Object Detection – Finding Things in Images

Classification vs Detection

Classification: “This is a cat” ✓
Detection: “There’s a cat at position (100, 200)” ✓✓

How YOLO Works (You Only Look Once)

1. Divide image into grid (e.g., 7×7)
2. Each cell predicts:
– What object? (cat, dog, car)
– Where is it? (x, y, width, height)
– How confident? (0-100%)
3. Combine predictions
4. Remove duplicates

Speed: 30+ images per second (real-time!)

Project: Simple Object Detector

Simplified YOLO
Dataset: COCO subset
Task: Draw boxes around objects

—

👁️ Part 4: Vision Transformers – Attention for Images

New Idea

Remember transformers from Module 4? Use them for images!

How:
1. Split image into patches (16×16 pixels each)
2. Treat each patch like a “word”
3. Use transformer attention
4. Predict image class

Project: Mini Vision Transformer

Small ViT (10M parameters)
Dataset: CIFAR-10
Training: 2-3 hours
Compare: CNN vs Transformer

—

⚡ ISL Optimization

1. MobileNet Architecture

Regular convolution: Expensive
Depthwise separable: 9x faster, same accuracy!

2. Progressive Resizing

Start: 64x64 images (fast!)
Middle: 128x128 images
End: 256x256 images (best quality)

Result: 2x faster training!

3. INT8 Quantization for Inference

After training, convert to 8-bit
Result: 4x faster predictions

4. ONNX Runtime

Export model to ONNX format
Use optimized runtime
Result: 2-3x faster inference on CPU

—

📚 Resources

CIFAR-10, ImageNet subsets
COCO dataset (object detection)
Albumentations library
OpenCV
Roboflow (create custom datasets)

—

✅ Learning Checklist

[ ] Build ResNet with skip connections
[ ] Create data augmentation pipeline
[ ] Implement object detector
[ ] Build vision transformer
[ ] Apply MobileNet optimizations
[ ] Use progressive resizing
[ ] Deploy with ONNX Runtime

—

🚀 Next Steps

Module 7: Training Tricks

Learn professional techniques to train faster and better!

📚 References & Further Reading

Dive deeper with these carefully selected resources:

📖 ResNet Paper

by He et al.
📖 YOLO: Real-Time Object Detection

by Redmon et al.
📖 Vision Transformer Paper

by Dosovitskiy et al.

📝 Related Topics

→
ResNets: Skip Connections Explained
→
Object Detection: YOLO vs R-CNN
→
Vision Transformers vs CNNs

🎯 Module Resources

🗺️ Course Navigation

← Back to Course Overview