ISL Optimization Guide

ISL Optimization Guide

Inverse Scaling Law: As modular capability (C) increases, existential cost (T) decreases.

This guide shows you how to maximize your AI models’ capabilities while minimizing resource usage.

🧠 Memory Management

1. Profiling Memory Usage

from memory_profiler import profile
import psutil

@profile def train_model(): # Your training code pass

Or use psutil for real-time monitoring

process = psutil.Process() print(f"RAM usage: {process.memory_info().rss / 10243:.2f} GB")

2. Gradient Checkpointing

Problem: Deep networks store all intermediate activations → lots of memory

Solution: Recompute activations during backward pass

import torch
from torch.utils.checkpoint import checkpoint

class MyModel(nn.Module): def forward(self, x): # Checkpoint expensive layers x = checkpoint(self.expensive_layer, x) return x

Result: 10x less memory, 30% slower (worth it!)

3. Memory-Mapped Files

Problem: Dataset larger than RAM

Solution: Load data from disk as needed

import numpy as np

Create memory-mapped array

data = np.memmap('huge_data.npy', dtype='float32', mode='r', shape=(1000000, 784))

Access like normal array, but doesn't load all into RAM

batch = data[0:32]

4. Garbage Collection

import gc
import torch

After training epoch

del loss, outputs torch.cuda.empty_cache() # If using GPU gc.collect()

⚡ Computation Optimization

1. Vectorization with NumPy

Slow (Python loops)

result = [] for i in range(1000000): result.append(i * 2)

Fast (NumPy vectorization)

result = np.arange(1000000) * 2

Speed difference: 10-100x faster!

2. JIT Compilation with Numba

from numba import jit

@jit(nopython=True) def fast_function(x): total = 0 for i in range(len(x)): total += x[i] 2 return total

First call: Compiles (slow)
Subsequent calls: 10-100x faster!

3. CPU Optimization

Intel MKL (Math Kernel Library):

Install

conda install mkl mkl-service

Verify

python -c "import numpy as np; np.show_config()"

OpenBLAS:

pip install openblas

Result: 2-3x faster on CPU!

📊 Data Optimization

1. Streaming Data

def data_generator(filename, batch_size=32):
    with open(filename) as f:
        batch = []
        for line in f:
            batch.append(process(line))
            if len(batch) == batch_size:
                yield np.array(batch)
                batch = []

Use in training

for batch in data_generator('huge_file.txt'): model.train_on_batch(batch)

2. Efficient Data Loading

from torch.utils.data import DataLoader

Use multiple workers for parallel loading

dataloader = DataLoader( dataset, batch_size=32, num_workers=4, # Load data in background pin_memory=True, # Faster GPU transfer prefetch_factor=2 # Prefetch batches )

3. Data Caching

from functools import lru_cache

@lru_cache(maxsize=1000) def load_and_process_image(path): img = load_image(path) return preprocess(img)

4. Sparse Matrices

from scipy.sparse import csr_matrix

Dense: [0, 0, 5, 0, 0, 3, 0, 0]

dense = np.array([0, 0, 5, 0, 0, 3, 0, 0])

Sparse: Only store non-zero values

sparse = csr_matrix(dense)

print(f"Dense size: {dense.nbytes} bytes") print(f"Sparse size: {sparse.data.nbytes + sparse.indices.nbytes} bytes")

🤖 Model Optimization

1. Quantization

Dynamic Quantization (Easiest):

import torch

model = YourModel() quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, # Layers to quantize dtype=torch.qint8 )

Result: 4x smaller, 2-3x faster!

Static Quantization (Better):

model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)

Calibrate with representative data

for data in calibration_data: model(data)

torch.quantization.convert(model, inplace=True)

2. Pruning

import torch.nn.utils.prune as prune

Prune 30% of weights in linear layer

prune.l1_unstructured(model.linear, name='weight', amount=0.3)

Make pruning permanent

prune.remove(model.linear, 'weight')

3. Knowledge Distillation

def distillation_loss(student_logits, teacher_logits, labels, T=2.0, alpha=0.5):
    # Soft targets from teacher
    soft_loss = nn.KLDivLoss()(
        F.log_softmax(student_logits / T, dim=1),
        F.softmax(teacher_logits / T, dim=1)
    )  (T  T)
    
    # Hard targets (actual labels)
    hard_loss = F.cross_entropy(student_logits, labels)
    
    return alpha  soft_loss + (1 - alpha)  hard_loss

4. Architecture Search

Depthwise Separable Convolutions (MobileNet):

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        # Depthwise: Each channel separately
        self.depthwise = nn.Conv2d(in_channels, in_channels, 
                                   kernel_size=3, groups=in_channels)
        # Pointwise: 1x1 conv to combine
        self.pointwise = nn.Conv2d(in_channels, out_channels, kernel_size=1)
    
    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        return x

Result: 9x fewer parameters, same accuracy!

🏋️ Training Optimization

1. Mixed Precision Training

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for data, target in dataloader: optimizer.zero_grad() # Forward pass in FP16 with autocast(): output = model(data) loss = criterion(output, target) # Backward pass with gradient scaling scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()

Result: 2x faster, 2x less memory!

2. Gradient Accumulation

accumulation_steps = 4

for i, (data, target) in enumerate(dataloader): output = model(data) loss = criterion(output, target) loss = loss / accumulation_steps # Normalize loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()

Result: Effective batch size = batch_size × accumulation_steps

3. Learning Rate Scheduling

from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = torch.optim.Adam(model.parameters(), lr=0.01) scheduler = CosineAnnealingLR(optimizer, T_max=100)

for epoch in range(100): train(...) scheduler.step()

4. Automatic Batch Size Finder

def find_max_batch_size(model, input_shape, max_batch=1024):
    batch_size = 1
    while batch_size <= max_batch:
        try:
            test_input = torch.randn(batch_size, *input_shape)
            output = model(test_input)
            loss = output.sum()
            loss.backward()
            batch_size *= 2
        except RuntimeError:  # Out of memory
            return batch_size // 2
    return max_batch

---

🚀 Inference Optimization

1. ONNX Runtime

import onnx
import onnxruntime as ort

Export to ONNX

torch.onnx.export(model, dummy_input, "model.onnx")

Load with ONNX Runtime

session = ort.InferenceSession("model.onnx")

Inference (2-3x faster!)

outputs = session.run(None, {"input": input_data})

2. Batch Inference

Slow: Process one at a time

for img in images: result = model(img)

Fast: Process in batches

batch_size = 32 for i in range(0, len(images), batch_size): batch = images[i:i+batch_size] results = model(batch)

3. Model Caching

from functools import lru_cache

@lru_cache(maxsize=1000) def predict_cached(image_hash): return model(load_image(image_hash))

4. CPU-Specific Optimizations

Enable Intel optimizations

torch.set_num_threads(4) # Use 4 CPU cores torch.set_flush_denormal(True)

For inference only

with torch.no_grad(): output = model(input)

---

📊 Benchmarking

import time
import torch

def benchmark(model, input_shape, num_runs=100): model.eval() input_data = torch.randn(input_shape) # Warmup for _ in range(10): _ = model(input_data) # Benchmark start = time.time() for _ in range(num_runs): with torch.no_grad(): _ = model(input_data) end = time.time() avg_time = (end - start) / num_runs print(f"Average inference time: {avg_time*1000:.2f}ms") # Memory print(f"Model size: {sum(p.numel() for p in model.parameters())/1e6:.2f}M params")

---

🎯 ISL Optimization Checklist

For every project, aim to apply:

Memory:

  • [ ] Gradient checkpointing for deep models
  • [ ] Memory profiling to find leaks
  • [ ] Streaming for large datasets
  • [ ] Sparse matrices where applicable

Computation:

  • [ ] Vectorization (NumPy)
  • [ ] CPU optimization (MKL/OpenBLAS)
  • [ ] JIT compilation (Numba) for bottlenecks

Data:

  • [ ] Efficient data loading (num_workers)
  • [ ] Caching frequently used data
  • [ ] Data augmentation instead of more data

Model:

  • [ ] Quantization (INT8)
  • [ ] Pruning (remove 30-50% weights)
  • [ ] Efficient architectures (MobileNet, etc.)

Training:

  • [ ] Mixed precision (FP16)
  • [ ] Gradient accumulation
  • [ ] Learning rate scheduling

Inference:

  • [ ] ONNX export
  • [ ] Batch inference
  • [ ] Model caching

---

Goal: Build models that are 10x more efficient while maintaining 95%+ of the original performance!