BEASTBULLET: The Complete Journey – From Naive Scaling to ISL Discipline

BEASTBULLET: The Complete Journey

From Naive Scaling to ISL Discipline

By Shrikant Bhosale and his team | December 20, 2024 |
Complete Research Narrative

💡

What This Article Is

This is a complete, honest research
narrative documenting our journey building BEASTBULLET – a privacy-first distributed AI
system. You’ll learn:

How we achieved 99% accuracy… then discovered it was completely
meaningless
The scientific mistakes we made (and how to avoid them)
The Inverse Scaling Law (ISL) framework that saved our project
Real metrics, real failures, real lessons

Audience: Researchers, engineers, ML practitioners,
students

Reading time: 15 minutes

Takeaway: Scientific rigor beats intuition every time

The Vision
The 99% Accuracy Trap
The 63% Reality Check
Everything We Did Wrong
The ISL Framework
Results & Lessons
Reproducibility
Next Steps

Part I: The Vision – Why We Built BEASTBULLET

In late 2024, my team and I faced a dilemma: we needed
powerful AI capabilities, but sending data to cloud providers felt wrong.

The Privacy-Power Dilemma

The options were unsatisfying:

Claude Sonnet 3.5: Powerful, but $20/month and zero privacy
ChatGPT: Same issues, plus rate limits
Local models: Privacy ✓, but performance lagged

We wanted both – so we built BEASTBULLET.

What Is BEASTBULLET?

A distributed AI system using consumer hardware:

PC: Knowledge retrieval (Wikipedia FAISS index)
Mobile devices: 18 specialized micro-experts
Total cost: Under $500
Privacy: 100% local processing

Key insight: Separate knowledge retrieval (PC) from reasoning
(mobile experts).

Part II: The 99% Accuracy Trap

We achieved 99% accuracy on our first validator training.
We celebrated. We were completely wrong.

The Setup

Training configuration:

Dataset: 200 SNLI examples
Model: 79,000 parameters
Embeddings: Random noise (torch.randn)
Result: 99% training accuracy

What Actually Happened

Our model learned to memorize random noise, not understand language.

Parameters: 79,000
Examples: 200
Ratio: 395 parameters per example

With 395 parameters per example, the model trivially memorized everything – including noise.

❌

Lesson 1: High accuracy on small data is
suspicious

When parameters >> examples, perfect accuracy means overfitting, not learning.
Always validate on held-out data.

The Random Embeddings Disaster

# What we did (WRONG)
embedding = torch.randn(1, 256)  # Random noise!

# What we should have done
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode(text)  # Actual semantics

When we tested on real data: 35% accuracy.
Worse than random guessing.

Part III: The 63% Reality Check

We fixed the embeddings, expanded to 738 examples, ran
5-fold cross-validation.

Result: 63.3% ± 3.0%
validation accuracy

The Overfitting Evidence

Fold 1:
  Train Accuracy: 98.3%
  Val Accuracy:   60.5%
  GAP:            37.8%  ← MASSIVE OVERFITTING

Why this happened:

79,000 params / 738 examples = 107 params/example
Rule of thumb: Need 10-100 examples per parameter
We had 0.009 examples per parameter

⚠️

Lesson 2: Parameters vs Examples ratio
matters

Aim for 10-100 examples per parameter. Below that, you’re in overfitting
territory.

Part IV: Everything We Did Wrong

A brutally honest list of scientific failures:

1. No Hyperparameter Search

lr = 1e-3              # ← HARDCODED, never tuned
weight_decay = 1e-2    # ← "100x stronger sounds good!"

Never tested: 1e-4, 5e-4, 1e-2 for learning rate

Never tested: 0, 1e-5, 1e-4 for weight decay

2. No Architecture Justification

hidden_dim = 128  # Why 128? Why not 64 or 256?
# Answer: No reason. Just picked it.

3. Dataset Too Small

738 examples for 79K parameters
Should have: 79K × 10 = 790,000 examples minimum
Actually had: 0.09% of required
data

4. Early Stopping Disabled

patience = 999  # Effectively disabled

Model kept training even when validation loss increased.

💡

Lesson 3: Early stopping isn’t optional

It’s essential for small data. Set patience=5-10 and monitor val_loss religiously.

5. No Baseline Comparison

Never compared to:

Logistic Regression
Random Forest
Linear SVM

This was the biggest
mistake.

🚨

Lesson 4: Always compare to simple
baselines

If your neural network can’t beat logistic regression, your architecture choice
needs rethinking.

Scientific Rigor Score: 1/10

Part V: The ISL Framework – Scientific Redemption

We adopted Inverse Scaling Law (ISL)
principles:

“When data is scarce and signal is structured, smaller, constrained, well-tested models outperform
large, lazy ones.”

Learn more about ISL: codeberg.org/ishrikantbhosale/isl

What is Inverse Scaling Law?

Traditional Scaling (Forward)

More parameters = better performance
Works when you have massive data
Example: GPT-4 (trillions of tokens)

Inverse Scaling (ISL)

Fewer parameters = better generalization (when
data is limited)
Works when you have structured, small datasets
Example: Our 738 examples

When to scale down vs scale up:

Scale UP when: Data > 100K
examples, unstructured tasks, compute available
Scale DOWN when: Data < 10K examples, structured tasks, overfitting detected

ISL Principle 1: Baseline First

Logistic Regression: 52.7%
Linear SVM:          56.1%
Random Forest:       58.1% ← Best baseline

ISL Threshold: 66.1% (best baseline + 8%)

Rule: If neural model doesn’t beat this, it’s INVALID.

ISL Principle 2: Brutal Capacity Reduction

# OLD (79K params)
Linear(384 → 128) → LayerNorm → GELU → Dropout
Linear(128 → 128) → LayerNorm → GELU → Dropout
Linear(128 → 64) → GELU
Linear(64 → 3)

# NEW ISL-MLP (12.4K params)
Linear(384 → 32)  # 12,320 params
ReLU
Dropout(0.1)
Linear(32 → 3)    # 99 params

Reduction: 6.3× smaller (79K → 12.4K)

ISL Principle 3: Early Stopping is LAW

patience = 5  # Non-negotiable
monitor = 'val_loss'
restore_best_weights = True

Result: Stopped at epoch 14 (prevented overfitting)

Part VI: Results & Lessons

Experiment Comparison Table

Model	Params	Dataset Size	Val Accuracy	Train/Val Gap	Early Stopping	Baselines
Naive MLP (v1)	79,000	200	35%	64%	❌	❌
Naive MLP (v2)	79,000	738	63.3%	35%	❌	❌
ISL-MLP	12,419	738	56.8%	18%	✅	✅
Random Forest	—	738	58.1%	6%	✅	✅
Logistic Reg	—	738	52.7%	4%	✅	✅

The Honest Result

ISL-MLP Performance:

Validation Accuracy: 56.8%
ISL Threshold: 66.1%
Status: ❌ FAILED – Below
threshold

ISL Decision:
Use Random Forest baseline (58.1%)

Why This is a Success

ISL correctly identified that neural model doesn’t justify its
complexity.

This is exactly what ISL is designed to do
– prevent wasted effort on unjustified models.

✅

Lesson 5: Knowing when NOT to use neural
networks is valuable

ISL saved us from deploying an overfitted, underperforming model. That’s a win.

Lessons Learned

Baseline-first prevents waste – Saved hours of neural
tuning
Small datasets need simple models – 738 examples
insufficient for 12.4K params
Early stopping catches overfitting – Train: 74.9%,
Val: 56.8% (18% gap)
Error analysis > hyperparameter tuning – Found length
gap, mislabeled examples
High accuracy on small data is suspicious – Always
validate
Parameters >> examples = overfitting – Need 10-100
examples per parameter

Part VII: Reproducibility

To ensure full reproducibility, here are the exact
configurations:

Environment Setup

# Python version
python==3.10.12

# Core dependencies
torch==2.1.0
sentence-transformers==2.2.2
scikit-learn==1.3.2
numpy==1.24.3
pandas==2.0.3

# Install
pip install torch sentence-transformers scikit-learn numpy pandas

Random Seed Control

import random
import numpy as np
import torch

def set_all_seeds(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_all_seeds(42)

Exact Model Configuration

# ISL-MLP Architecture
model = nn.Sequential(
    nn.Linear(384, 32),      # 12,320 params
    nn.ReLU(),
    nn.Dropout(0.1),
    nn.Linear(32, 3)         # 99 params
)
# Total: 12,419 params

# Training hyperparameters
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,                 # Best from ISL sweep
    weight_decay=1e-4        # Best from ISL sweep
)

# Early stopping
patience = 5
monitor = 'val_loss'
restore_best_weights = True

Part VIII: Next Steps

For This Project

Immediate (Week 1):

✅ Deploy Random Forest baseline (58.1%)
🔄 Fix 10 mislabeled examples identified in error analysis
🔄 Add 200 more long examples (address 19.7% gap)

Short-term (Month 1):

📋 Expand to 2,500 examples
📋 Retrain all baselines
📋 Re-evaluate if neural becomes viable (target: 70%+)

Long-term (Quarter 1):

📋 Production monitoring dashboard
📋 Active learning pipeline
📋 Continuous improvement loop

For You (The Reader)

If you’re building ML models on small
data:

1. Clone our approach:

# Repository coming soon on Codeberg
# https://codeberg.org/ishrikantbhosale/beastbullet
# Check back at potatobullet.com for updates

2. Try with your own data:

Replace SNLI with your dataset
Run baseline comparison first
Apply ISL principles

3. Compare with your own baselines:

Always test Logistic Regression, Random Forest, SVM
Set ISL threshold = best_baseline + 8%
Only use neural if it beats threshold

4. Read Part 2 in series:

Coming soon: “BEASTBULLET Expert Training”
How we trained 18 specialized micro-experts
Mobile deployment strategies

Questions or want to
collaborate?

📧 Email: bhosale@potatobullet.com

💻 Codeberg: codeberg.org/ishrikantbhosale (Repository coming
soon)

🌐 Follow our journey: potatobullet.com