BEASTBULLET: The Complete Journey
From Naive Scaling to ISL Discipline
By Shrikant Bhosale and his team | December 20, 2024 |
Complete Research Narrative
What This Article Is
This is a complete, honest research
narrative documenting our journey building BEASTBULLET – a privacy-first distributed AI
system. You’ll learn:
- How we achieved 99% accuracy… then discovered it was completely
meaningless - The scientific mistakes we made (and how to avoid them)
- The Inverse Scaling Law (ISL) framework that saved our project
- Real metrics, real failures, real lessons
students
Table of Contents
Part I: The Vision – Why We Built BEASTBULLET
In late 2024, my team and I faced a dilemma: we needed
powerful AI capabilities, but sending data to cloud providers felt wrong.
The Privacy-Power Dilemma
The options were unsatisfying:
- Claude Sonnet 3.5: Powerful, but $20/month and zero privacy
- ChatGPT: Same issues, plus rate limits
- Local models: Privacy ✓, but performance lagged
We wanted both – so we built BEASTBULLET.
What Is BEASTBULLET?
A distributed AI system using consumer hardware:
- PC: Knowledge retrieval (Wikipedia FAISS index)
- Mobile devices: 18 specialized micro-experts
- Total cost: Under $500
- Privacy: 100% local processing
Key insight: Separate knowledge retrieval (PC) from reasoning
(mobile experts).
Part II: The 99% Accuracy Trap
We achieved 99% accuracy on our first validator training.
We celebrated. We were completely wrong.
The Setup
Training configuration:
- Dataset: 200 SNLI examples
- Model: 79,000 parameters
- Embeddings: Random noise (torch.randn)
- Result: 99% training accuracy
What Actually Happened
Our model learned to memorize random noise, not understand language.
Parameters: 79,000
Examples: 200
Ratio: 395 parameters per example
With 395 parameters per example, the model trivially memorized everything – including noise.
Lesson 1: High accuracy on small data is
suspicious
When parameters >> examples, perfect accuracy means overfitting, not learning.
Always validate on held-out data.
The Random Embeddings Disaster
# What we did (WRONG) embedding = torch.randn(1, 256) # Random noise! # What we should have done from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') embedding = model.encode(text) # Actual semantics
When we tested on real data: 35% accuracy.
Worse than random guessing.
Part III: The 63% Reality Check
We fixed the embeddings, expanded to 738 examples, ran
5-fold cross-validation.
Result: 63.3% ± 3.0%
validation accuracy
The Overfitting Evidence
Fold 1:
Train Accuracy: 98.3%
Val Accuracy: 60.5%
GAP: 37.8% ← MASSIVE OVERFITTING
Why this happened:
- 79,000 params / 738 examples = 107 params/example
- Rule of thumb: Need 10-100 examples per parameter
- We had 0.009 examples per parameter
Lesson 2: Parameters vs Examples ratio
matters
Aim for 10-100 examples per parameter. Below that, you’re in overfitting
territory.
Part IV: Everything We Did Wrong
A brutally honest list of scientific failures:
1. No Hyperparameter Search
lr = 1e-3 # ← HARDCODED, never tuned weight_decay = 1e-2 # ← "100x stronger sounds good!"
Never tested: 1e-4, 5e-4, 1e-2 for learning rate
Never tested: 0, 1e-5, 1e-4 for weight decay
2. No Architecture Justification
hidden_dim = 128 # Why 128? Why not 64 or 256? # Answer: No reason. Just picked it.
3. Dataset Too Small
- 738 examples for 79K parameters
- Should have: 79K × 10 = 790,000 examples minimum
- Actually had: 0.09% of required
data
4. Early Stopping Disabled
patience = 999 # Effectively disabled
Model kept training even when validation loss increased.
Lesson 3: Early stopping isn’t optional
It’s essential for small data. Set patience=5-10 and monitor val_loss religiously.
5. No Baseline Comparison
Never compared to:
- Logistic Regression
- Random Forest
- Linear SVM
This was the biggest
mistake.
Lesson 4: Always compare to simple
baselines
If your neural network can’t beat logistic regression, your architecture choice
needs rethinking.
Scientific Rigor Score: 1/10
Part V: The ISL Framework – Scientific Redemption
We adopted Inverse Scaling Law (ISL)
principles:
“When data is scarce and signal is structured, smaller, constrained, well-tested models outperform
large, lazy ones.”
Learn more about ISL: codeberg.org/ishrikantbhosale/isl
What is Inverse Scaling Law?
Traditional Scaling (Forward)
- More parameters = better performance
- Works when you have massive data
- Example: GPT-4 (trillions of tokens)
Inverse Scaling (ISL)
- Fewer parameters = better generalization (when
data is limited) - Works when you have structured, small datasets
- Example: Our 738 examples
When to scale down vs scale up:
- Scale UP when: Data > 100K
examples, unstructured tasks, compute available - Scale DOWN when: Data < 10K examples, structured tasks, overfitting detected
ISL Principle 1: Baseline First
Logistic Regression: 52.7%
Linear SVM: 56.1%
Random Forest: 58.1% ← Best baseline
ISL Threshold: 66.1% (best baseline + 8%)
Rule: If neural model doesn’t beat this, it’s INVALID.
ISL Principle 2: Brutal Capacity Reduction
# OLD (79K params) Linear(384 → 128) → LayerNorm → GELU → Dropout Linear(128 → 128) → LayerNorm → GELU → Dropout Linear(128 → 64) → GELU Linear(64 → 3) # NEW ISL-MLP (12.4K params) Linear(384 → 32) # 12,320 params ReLU Dropout(0.1) Linear(32 → 3) # 99 params
Reduction: 6.3× smaller (79K → 12.4K)
ISL Principle 3: Early Stopping is LAW
patience = 5 # Non-negotiable
monitor = 'val_loss'
restore_best_weights = True
Result: Stopped at epoch 14 (prevented overfitting)
Part VI: Results & Lessons
Experiment Comparison Table
The Honest Result
ISL-MLP Performance:
- Validation Accuracy: 56.8%
- ISL Threshold: 66.1%
- Status: ❌ FAILED – Below
threshold
ISL Decision:
Use Random Forest baseline (58.1%)
Why This is a Success
ISL correctly identified that neural model doesn’t justify its
complexity.
This is exactly what ISL is designed to do
– prevent wasted effort on unjustified models.
Lesson 5: Knowing when NOT to use neural
networks is valuable
ISL saved us from deploying an overfitted, underperforming model. That’s a win.
Lessons Learned
- Baseline-first prevents waste – Saved hours of neural
tuning - Small datasets need simple models – 738 examples
insufficient for 12.4K params - Early stopping catches overfitting – Train: 74.9%,
Val: 56.8% (18% gap) - Error analysis > hyperparameter tuning – Found length
gap, mislabeled examples - High accuracy on small data is suspicious – Always
validate - Parameters >> examples = overfitting – Need 10-100
examples per parameter
Part VII: Reproducibility
To ensure full reproducibility, here are the exact
configurations:
Environment Setup
# Python version python==3.10.12 # Core dependencies torch==2.1.0 sentence-transformers==2.2.2 scikit-learn==1.3.2 numpy==1.24.3 pandas==2.0.3 # Install pip install torch sentence-transformers scikit-learn numpy pandas
Random Seed Control
import random
import numpy as np
import torch
def set_all_seeds(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_all_seeds(42)
Exact Model Configuration
# ISL-MLP Architecture model = nn.Sequential( nn.Linear(384, 32), # 12,320 params nn.ReLU(), nn.Dropout(0.1), nn.Linear(32, 3) # 99 params ) # Total: 12,419 params # Training hyperparameters optimizer = torch.optim.AdamW( model.parameters(), lr=3e-4, # Best from ISL sweep weight_decay=1e-4 # Best from ISL sweep ) # Early stopping patience = 5 monitor = 'val_loss' restore_best_weights = True
Part VIII: Next Steps
For This Project
Immediate (Week 1):
- ✅ Deploy Random Forest baseline (58.1%)
- 🔄 Fix 10 mislabeled examples identified in error analysis
- 🔄 Add 200 more long examples (address 19.7% gap)
Short-term (Month 1):
- 📋 Expand to 2,500 examples
- 📋 Retrain all baselines
- 📋 Re-evaluate if neural becomes viable (target: 70%+)
Long-term (Quarter 1):
- 📋 Production monitoring dashboard
- 📋 Active learning pipeline
- 📋 Continuous improvement loop
For You (The Reader)
If you’re building ML models on small
data:
1. Clone our approach:
# Repository coming soon on Codeberg # https://codeberg.org/ishrikantbhosale/beastbullet # Check back at potatobullet.com for updates
2. Try with your own data:
- Replace SNLI with your dataset
- Run baseline comparison first
- Apply ISL principles
3. Compare with your own baselines:
- Always test Logistic Regression, Random Forest, SVM
- Set ISL threshold = best_baseline + 8%
- Only use neural if it beats threshold
4. Read Part 2 in series:
- Coming soon: “BEASTBULLET Expert Training”
- How we trained 18 specialized micro-experts
- Mobile deployment strategies
Questions or want to
collaborate?
📧 Email: bhosale@potatobullet.com
💻 Codeberg: codeberg.org/ishrikantbhosale (Repository coming
soon)
🌐 Follow our journey: potatobullet.com