ISL Optimization Guide
Inverse Scaling Law: As modular capability (C) increases, existential cost (T) decreases.
This guide shows you how to maximize your AI models’ capabilities while minimizing resource usage.
—
🧠 Memory Management
1. Profiling Memory Usage
from memory_profiler import profile
import psutil@profile
def train_model():
# Your training code
pass
Or use psutil for real-time monitoring
process = psutil.Process()
print(f"RAM usage: {process.memory_info().rss / 10243:.2f} GB")
2. Gradient Checkpointing
Problem: Deep networks store all intermediate activations → lots of memory
Solution: Recompute activations during backward pass
import torch
from torch.utils.checkpoint import checkpointclass MyModel(nn.Module):
def forward(self, x):
# Checkpoint expensive layers
x = checkpoint(self.expensive_layer, x)
return x
Result: 10x less memory, 30% slower (worth it!)
3. Memory-Mapped Files
Problem: Dataset larger than RAM
Solution: Load data from disk as needed
import numpy as npCreate memory-mapped array
data = np.memmap('huge_data.npy', dtype='float32', mode='r', shape=(1000000, 784))Access like normal array, but doesn't load all into RAM
batch = data[0:32]
4. Garbage Collection
import gc
import torchAfter training epoch
del loss, outputs
torch.cuda.empty_cache() # If using GPU
gc.collect()
—
⚡ Computation Optimization
1. Vectorization with NumPy
Slow (Python loops)
result = []
for i in range(1000000):
result.append(i * 2)Fast (NumPy vectorization)
result = np.arange(1000000) * 2
Speed difference: 10-100x faster!
2. JIT Compilation with Numba
from numba import jit@jit(nopython=True)
def fast_function(x):
total = 0
for i in range(len(x)):
total += x[i] 2
return total
First call: Compiles (slow)
Subsequent calls: 10-100x faster!
3. CPU Optimization
Intel MKL (Math Kernel Library):
Install
conda install mkl mkl-serviceVerify
python -c "import numpy as np; np.show_config()"
OpenBLAS:
pip install openblas
Result: 2-3x faster on CPU!
—
📊 Data Optimization
1. Streaming Data
def data_generator(filename, batch_size=32):
with open(filename) as f:
batch = []
for line in f:
batch.append(process(line))
if len(batch) == batch_size:
yield np.array(batch)
batch = []Use in training
for batch in data_generator('huge_file.txt'):
model.train_on_batch(batch)
2. Efficient Data Loading
from torch.utils.data import DataLoaderUse multiple workers for parallel loading
dataloader = DataLoader(
dataset,
batch_size=32,
num_workers=4, # Load data in background
pin_memory=True, # Faster GPU transfer
prefetch_factor=2 # Prefetch batches
)
3. Data Caching
from functools import lru_cache@lru_cache(maxsize=1000)
def load_and_process_image(path):
img = load_image(path)
return preprocess(img)
4. Sparse Matrices
from scipy.sparse import csr_matrixDense: [0, 0, 5, 0, 0, 3, 0, 0]
dense = np.array([0, 0, 5, 0, 0, 3, 0, 0])Sparse: Only store non-zero values
sparse = csr_matrix(dense)print(f"Dense size: {dense.nbytes} bytes")
print(f"Sparse size: {sparse.data.nbytes + sparse.indices.nbytes} bytes")
—
🤖 Model Optimization
1. Quantization
Dynamic Quantization (Easiest):
import torchmodel = YourModel()
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear}, # Layers to quantize
dtype=torch.qint8
)
Result: 4x smaller, 2-3x faster!
Static Quantization (Better):
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)Calibrate with representative data
for data in calibration_data:
model(data)torch.quantization.convert(model, inplace=True)
2. Pruning
import torch.nn.utils.prune as prunePrune 30% of weights in linear layer
prune.l1_unstructured(model.linear, name='weight', amount=0.3)Make pruning permanent
prune.remove(model.linear, 'weight')
3. Knowledge Distillation
def distillation_loss(student_logits, teacher_logits, labels, T=2.0, alpha=0.5):
# Soft targets from teacher
soft_loss = nn.KLDivLoss()(
F.log_softmax(student_logits / T, dim=1),
F.softmax(teacher_logits / T, dim=1)
) (T T)
# Hard targets (actual labels)
hard_loss = F.cross_entropy(student_logits, labels)
return alpha soft_loss + (1 - alpha) hard_loss
4. Architecture Search
Depthwise Separable Convolutions (MobileNet):
class DepthwiseSeparableConv(nn.Module):
def __init__(self, in_channels, out_channels):
super().__init__()
# Depthwise: Each channel separately
self.depthwise = nn.Conv2d(in_channels, in_channels,
kernel_size=3, groups=in_channels)
# Pointwise: 1x1 conv to combine
self.pointwise = nn.Conv2d(in_channels, out_channels, kernel_size=1)
def forward(self, x):
x = self.depthwise(x)
x = self.pointwise(x)
return x
Result: 9x fewer parameters, same accuracy!
—
🏋️ Training Optimization
1. Mixed Precision Training
from torch.cuda.amp import autocast, GradScalerscaler = GradScaler()
for data, target in dataloader:
optimizer.zero_grad()
# Forward pass in FP16
with autocast():
output = model(data)
loss = criterion(output, target)
# Backward pass with gradient scaling
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Result: 2x faster, 2x less memory!
2. Gradient Accumulation
accumulation_steps = 4for i, (data, target) in enumerate(dataloader):
output = model(data)
loss = criterion(output, target)
loss = loss / accumulation_steps # Normalize
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Result: Effective batch size = batch_size × accumulation_steps
3. Learning Rate Scheduling
from torch.optim.lr_scheduler import CosineAnnealingLRoptimizer = torch.optim.Adam(model.parameters(), lr=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=100)
for epoch in range(100):
train(...)
scheduler.step()
4. Automatic Batch Size Finder
def find_max_batch_size(model, input_shape, max_batch=1024):
batch_size = 1
while batch_size <= max_batch:
try:
test_input = torch.randn(batch_size, *input_shape)
output = model(test_input)
loss = output.sum()
loss.backward()
batch_size *= 2
except RuntimeError: # Out of memory
return batch_size // 2
return max_batch
---
🚀 Inference Optimization
1. ONNX Runtime
import onnx
import onnxruntime as ortExport to ONNX
torch.onnx.export(model, dummy_input, "model.onnx")Load with ONNX Runtime
session = ort.InferenceSession("model.onnx")Inference (2-3x faster!)
outputs = session.run(None, {"input": input_data})
2. Batch Inference
Slow: Process one at a time
for img in images:
result = model(img)Fast: Process in batches
batch_size = 32
for i in range(0, len(images), batch_size):
batch = images[i:i+batch_size]
results = model(batch)
3. Model Caching
from functools import lru_cache@lru_cache(maxsize=1000)
def predict_cached(image_hash):
return model(load_image(image_hash))
4. CPU-Specific Optimizations
Enable Intel optimizations
torch.set_num_threads(4) # Use 4 CPU cores
torch.set_flush_denormal(True)For inference only
with torch.no_grad():
output = model(input)
---
📊 Benchmarking
import time
import torchdef benchmark(model, input_shape, num_runs=100):
model.eval()
input_data = torch.randn(input_shape)
# Warmup
for _ in range(10):
_ = model(input_data)
# Benchmark
start = time.time()
for _ in range(num_runs):
with torch.no_grad():
_ = model(input_data)
end = time.time()
avg_time = (end - start) / num_runs
print(f"Average inference time: {avg_time*1000:.2f}ms")
# Memory
print(f"Model size: {sum(p.numel() for p in model.parameters())/1e6:.2f}M params")
---
🎯 ISL Optimization Checklist
For every project, aim to apply:
Memory:
- [ ] Gradient checkpointing for deep models
- [ ] Memory profiling to find leaks
- [ ] Streaming for large datasets
- [ ] Sparse matrices where applicable
Computation:
- [ ] Vectorization (NumPy)
- [ ] CPU optimization (MKL/OpenBLAS)
- [ ] JIT compilation (Numba) for bottlenecks
Data:
- [ ] Efficient data loading (num_workers)
- [ ] Caching frequently used data
- [ ] Data augmentation instead of more data
Model:
- [ ] Quantization (INT8)
- [ ] Pruning (remove 30-50% weights)
- [ ] Efficient architectures (MobileNet, etc.)
Training:
- [ ] Mixed precision (FP16)
- [ ] Gradient accumulation
- [ ] Learning rate scheduling
Inference:
- [ ] ONNX export
- [ ] Batch inference
- [ ] Model caching
---
Goal: Build models that are 10x more efficient while maintaining 95%+ of the original performance!