ISL Optimization Guide

Inverse Scaling Law: As modular capability (C) increases, existential cost (T) decreases.

This guide shows you how to maximize your AI models’ capabilities while minimizing resource usage.

—

🧠 Memory Management

1. Profiling Memory Usage

from memory_profiler import profile
import psutil
@profile
def train_model():
    # Your training code
    pass
Or use psutil for real-time monitoring
process = psutil.Process()
print(f"RAM usage: {process.memory_info().rss / 10243:.2f} GB")

2. Gradient Checkpointing

Problem: Deep networks store all intermediate activations → lots of memory

Solution: Recompute activations during backward pass

import torch
from torch.utils.checkpoint import checkpointclass MyModel(nn.Module):
    def forward(self, x):
        # Checkpoint expensive layers
        x = checkpoint(self.expensive_layer, x)
        return x

Result: 10x less memory, 30% slower (worth it!)

3. Memory-Mapped Files

Problem: Dataset larger than RAM

Solution: Load data from disk as needed

import numpy as np
Create memory-mapped array
data = np.memmap('huge_data.npy', dtype='float32', mode='r', shape=(1000000, 784))
Access like normal array, but doesn't load all into RAM
batch = data[0:32]

4. Garbage Collection

import gc
import torch
After training epoch
del loss, outputs
torch.cuda.empty_cache()  # If using GPU
gc.collect()

—

⚡ Computation Optimization

1. Vectorization with NumPy

Slow (Python loops)
result = []
for i in range(1000000):
    result.append(i * 2)
Fast (NumPy vectorization)
result = np.arange(1000000) * 2

Speed difference: 10-100x faster!

2. JIT Compilation with Numba

from numba import jit@jit(nopython=True)
def fast_function(x):
    total = 0
    for i in range(len(x)):
        total += x[i]  2
    return total

First call: Compiles (slow)
Subsequent calls: 10-100x faster!

3. CPU Optimization

Intel MKL (Math Kernel Library):

Install
conda install mkl mkl-service
Verify
python -c "import numpy as np; np.show_config()"

OpenBLAS:

pip install openblas

Result: 2-3x faster on CPU!

—

📊 Data Optimization

1. Streaming Data

def data_generator(filename, batch_size=32):
    with open(filename) as f:
        batch = []
        for line in f:
            batch.append(process(line))
            if len(batch) == batch_size:
                yield np.array(batch)
                batch = []
Use in training
for batch in data_generator('huge_file.txt'):
    model.train_on_batch(batch)

2. Efficient Data Loading

from torch.utils.data import DataLoader
Use multiple workers for parallel loading
dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,  # Load data in background
    pin_memory=True,  # Faster GPU transfer
    prefetch_factor=2  # Prefetch batches
)

3. Data Caching

from functools import lru_cache@lru_cache(maxsize=1000)
def load_and_process_image(path):
    img = load_image(path)
    return preprocess(img)

4. Sparse Matrices

from scipy.sparse import csr_matrix
Dense: [0, 0, 5, 0, 0, 3, 0, 0]
dense = np.array([0, 0, 5, 0, 0, 3, 0, 0])
Sparse: Only store non-zero values
sparse = csr_matrix(dense)print(f"Dense size: {dense.nbytes} bytes")
print(f"Sparse size: {sparse.data.nbytes + sparse.indices.nbytes} bytes")

—

🤖 Model Optimization

1. Quantization

Dynamic Quantization (Easiest):

import torch
model = YourModel()
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},  # Layers to quantize
    dtype=torch.qint8
)
Result: 4x smaller, 2-3x faster!

Static Quantization (Better):

model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)
Calibrate with representative data
for data in calibration_data:
    model(data)torch.quantization.convert(model, inplace=True)

2. Pruning

import torch.nn.utils.prune as prune
Prune 30% of weights in linear layer
prune.l1_unstructured(model.linear, name='weight', amount=0.3)
Make pruning permanent
prune.remove(model.linear, 'weight')

3. Knowledge Distillation

def distillation_loss(student_logits, teacher_logits, labels, T=2.0, alpha=0.5):
    # Soft targets from teacher
    soft_loss = nn.KLDivLoss()(
        F.log_softmax(student_logits / T, dim=1),
        F.softmax(teacher_logits / T, dim=1)
    )  (T  T)
    
    # Hard targets (actual labels)
    hard_loss = F.cross_entropy(student_logits, labels)
    
    return alpha  soft_loss + (1 - alpha)  hard_loss

4. Architecture Search

Depthwise Separable Convolutions (MobileNet):

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        # Depthwise: Each channel separately
        self.depthwise = nn.Conv2d(in_channels, in_channels, 
                                   kernel_size=3, groups=in_channels)
        # Pointwise: 1x1 conv to combine
        self.pointwise = nn.Conv2d(in_channels, out_channels, kernel_size=1)
    
    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        return x

Result: 9x fewer parameters, same accuracy!

—

🏋️ Training Optimization

1. Mixed Precision Training

from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()for data, target in dataloader:
    optimizer.zero_grad()
    
    # Forward pass in FP16
    with autocast():
        output = model(data)
        loss = criterion(output, target)
    
    # Backward pass with gradient scaling
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Result: 2x faster, 2x less memory!

2. Gradient Accumulation

accumulation_steps = 4for i, (data, target) in enumerate(dataloader):
    output = model(data)
    loss = criterion(output, target)
    loss = loss / accumulation_steps  # Normalize
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Result: Effective batch size = batch_size × accumulation_steps

3. Learning Rate Scheduling

from torch.optim.lr_scheduler import CosineAnnealingLR
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=100)for epoch in range(100):
    train(...)
    scheduler.step()

4. Automatic Batch Size Finder

def find_max_batch_size(model, input_shape, max_batch=1024):
    batch_size = 1
    while batch_size <= max_batch:
        try:
            test_input = torch.randn(batch_size, *input_shape)
            output = model(test_input)
            loss = output.sum()
            loss.backward()
            batch_size *= 2
        except RuntimeError:  # Out of memory
            return batch_size // 2
    return max_batch

---

🚀 Inference Optimization

1. ONNX Runtime

import onnx
import onnxruntime as ort
Export to ONNX
torch.onnx.export(model, dummy_input, "model.onnx")
Load with ONNX Runtime
session = ort.InferenceSession("model.onnx")
Inference (2-3x faster!)
outputs = session.run(None, {"input": input_data})

2. Batch Inference

Slow: Process one at a time
for img in images:
    result = model(img)
Fast: Process in batches
batch_size = 32
for i in range(0, len(images), batch_size):
    batch = images[i:i+batch_size]
    results = model(batch)

3. Model Caching

from functools import lru_cache@lru_cache(maxsize=1000)
def predict_cached(image_hash):
    return model(load_image(image_hash))

4. CPU-Specific Optimizations

Enable Intel optimizations
torch.set_num_threads(4)  # Use 4 CPU cores
torch.set_flush_denormal(True)
For inference only
with torch.no_grad():
    output = model(input)

---

📊 Benchmarking

import time
import torchdef benchmark(model, input_shape, num_runs=100):
    model.eval()
    input_data = torch.randn(input_shape)
    
    # Warmup
    for _ in range(10):
        _ = model(input_data)
    
    # Benchmark
    start = time.time()
    for _ in range(num_runs):
        with torch.no_grad():
            _ = model(input_data)
    end = time.time()
    
    avg_time = (end - start) / num_runs
    print(f"Average inference time: {avg_time*1000:.2f}ms")
    
    # Memory
    print(f"Model size: {sum(p.numel() for p in model.parameters())/1e6:.2f}M params")

---

🎯 ISL Optimization Checklist

For every project, aim to apply:

Memory:

[ ] Gradient checkpointing for deep models
[ ] Memory profiling to find leaks
[ ] Streaming for large datasets
[ ] Sparse matrices where applicable

Computation:

[ ] Vectorization (NumPy)
[ ] CPU optimization (MKL/OpenBLAS)
[ ] JIT compilation (Numba) for bottlenecks

Data:

[ ] Efficient data loading (num_workers)
[ ] Caching frequently used data
[ ] Data augmentation instead of more data

Model:

[ ] Quantization (INT8)
[ ] Pruning (remove 30-50% weights)
[ ] Efficient architectures (MobileNet, etc.)

Training:

[ ] Mixed precision (FP16)
[ ] Gradient accumulation
[ ] Learning rate scheduling

Inference:

[ ] ONNX export
[ ] Batch inference
[ ] Model caching

---

Goal: Build models that are 10x more efficient while maintaining 95%+ of the original performance!