BEASTBULLET: The Complete Journey – From Naive Scaling to ISL Discipline





BEASTBULLET: The Complete Journey – From Naive Scaling to ISL Discipline


BEASTBULLET: The Complete Journey

From Naive Scaling to ISL Discipline

By Shrikant Bhosale and his team | December 20, 2024 |
Complete Research Narrative

💡

What This Article Is

This is a complete, honest research
narrative
documenting our journey building BEASTBULLET – a privacy-first distributed AI
system. You’ll learn:

  • How we achieved 99% accuracy… then discovered it was completely
    meaningless
  • The scientific mistakes we made (and how to avoid them)
  • The Inverse Scaling Law (ISL) framework that saved our project
  • Real metrics, real failures, real lessons

Audience: Researchers, engineers, ML practitioners,
students
Reading time: 15 minutes
Takeaway: Scientific rigor beats intuition every time


Part I: The Vision – Why We Built BEASTBULLET

In late 2024, my team and I faced a dilemma: we needed
powerful AI capabilities, but sending data to cloud providers felt wrong.

The Privacy-Power Dilemma

The options were unsatisfying:

  • Claude Sonnet 3.5: Powerful, but $20/month and zero privacy
  • ChatGPT: Same issues, plus rate limits
  • Local models: Privacy ✓, but performance lagged

We wanted both – so we built BEASTBULLET.

What Is BEASTBULLET?

A distributed AI system using consumer hardware:

  • PC: Knowledge retrieval (Wikipedia FAISS index)
  • Mobile devices: 18 specialized micro-experts
  • Total cost: Under $500
  • Privacy: 100% local processing

Key insight: Separate knowledge retrieval (PC) from reasoning
(mobile experts).


Part II: The 99% Accuracy Trap

We achieved 99% accuracy on our first validator training.
We celebrated. We were completely wrong.

The Setup

Training configuration:

  • Dataset: 200 SNLI examples
  • Model: 79,000 parameters
  • Embeddings: Random noise (torch.randn)
  • Result: 99% training accuracy

What Actually Happened

Our model learned to memorize random noise, not understand language.

Parameters: 79,000

Examples: 200

Ratio: 395 parameters per example

With 395 parameters per example, the model trivially memorized everything – including noise.

Lesson 1: High accuracy on small data is
suspicious

When parameters >> examples, perfect accuracy means overfitting, not learning.
Always validate on held-out data.

The Random Embeddings Disaster

# What we did (WRONG)
embedding = torch.randn(1, 256)  # Random noise!

# What we should have done
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode(text)  # Actual semantics

When we tested on real data: 35% accuracy.
Worse than random guessing.


Part III: The 63% Reality Check

We fixed the embeddings, expanded to 738 examples, ran
5-fold cross-validation.

Result: 63.3% ± 3.0%
validation accuracy

The Overfitting Evidence

Fold 1:
  Train Accuracy: 98.3%
  Val Accuracy:   60.5%
  GAP:            37.8%  ← MASSIVE OVERFITTING

Why this happened:

  • 79,000 params / 738 examples = 107 params/example
  • Rule of thumb: Need 10-100 examples per parameter
  • We had 0.009 examples per parameter

⚠️

Lesson 2: Parameters vs Examples ratio
matters

Aim for 10-100 examples per parameter. Below that, you’re in overfitting
territory.


Part IV: Everything We Did Wrong

A brutally honest list of scientific failures:

1. No Hyperparameter Search

lr = 1e-3              # ← HARDCODED, never tuned
weight_decay = 1e-2    # ← "100x stronger sounds good!"

Never tested: 1e-4, 5e-4, 1e-2 for learning rate

Never tested: 0, 1e-5, 1e-4 for weight decay

2. No Architecture Justification

hidden_dim = 128  # Why 128? Why not 64 or 256?
# Answer: No reason. Just picked it.

3. Dataset Too Small

  • 738 examples for 79K parameters
  • Should have: 79K × 10 = 790,000 examples minimum
  • Actually had: 0.09% of required
    data

4. Early Stopping Disabled

patience = 999  # Effectively disabled

Model kept training even when validation loss increased.

💡

Lesson 3: Early stopping isn’t optional

It’s essential for small data. Set patience=5-10 and monitor val_loss religiously.

5. No Baseline Comparison

Never compared to:

  • Logistic Regression
  • Random Forest
  • Linear SVM

This was the biggest
mistake.

🚨

Lesson 4: Always compare to simple
baselines

If your neural network can’t beat logistic regression, your architecture choice
needs rethinking.

Scientific Rigor Score: 1/10


Part V: The ISL Framework – Scientific Redemption

We adopted Inverse Scaling Law (ISL)
principles:


“When data is scarce and signal is structured, smaller, constrained, well-tested models outperform
large, lazy ones.”

Learn more about ISL: codeberg.org/ishrikantbhosale/isl

What is Inverse Scaling Law?

Traditional Scaling (Forward)

  • More parameters = better performance
  • Works when you have massive data
  • Example: GPT-4 (trillions of tokens)

Inverse Scaling (ISL)

  • Fewer parameters = better generalization (when
    data is limited)
  • Works when you have structured, small datasets
  • Example: Our 738 examples

When to scale down vs scale up:

  • Scale UP when: Data > 100K
    examples, unstructured tasks, compute available
  • Scale DOWN when: Data < 10K examples, structured tasks, overfitting detected

ISL Principle 1: Baseline First

Logistic Regression: 52.7%
Linear SVM:          56.1%
Random Forest:       58.1% ← Best baseline

ISL Threshold: 66.1% (best baseline + 8%)

Rule: If neural model doesn’t beat this, it’s INVALID.

ISL Principle 2: Brutal Capacity Reduction

# OLD (79K params)
Linear(384 → 128) → LayerNorm → GELU → Dropout
Linear(128 → 128) → LayerNorm → GELU → Dropout
Linear(128 → 64) → GELU
Linear(64 → 3)

# NEW ISL-MLP (12.4K params)
Linear(384 → 32)  # 12,320 params
ReLU
Dropout(0.1)
Linear(32 → 3)    # 99 params

Reduction: 6.3× smaller (79K → 12.4K)

ISL Principle 3: Early Stopping is LAW

patience = 5  # Non-negotiable
monitor = 'val_loss'
restore_best_weights = True

Result: Stopped at epoch 14 (prevented overfitting)


Part VI: Results & Lessons

Experiment Comparison Table

Model Params Dataset Size Val Accuracy Train/Val
Gap
Early
Stopping
Baselines
Naive MLP (v1) 79,000 200 35% 64%
Naive MLP (v2) 79,000 738 63.3% 35%
ISL-MLP 12,419 738 56.8% 18%
Random Forest 738 58.1% 6%
Logistic Reg 738 52.7% 4%

The Honest Result

ISL-MLP Performance:

  • Validation Accuracy: 56.8%
  • ISL Threshold: 66.1%
  • Status: ❌ FAILED – Below
    threshold

ISL Decision:
Use Random Forest baseline (58.1%)

Why This is a Success

ISL correctly identified that neural model doesn’t justify its
complexity.

This is exactly what ISL is designed to do
– prevent wasted effort on unjustified models.

Lesson 5: Knowing when NOT to use neural
networks is valuable

ISL saved us from deploying an overfitted, underperforming model. That’s a win.

Lessons Learned

  1. Baseline-first prevents waste – Saved hours of neural
    tuning
  2. Small datasets need simple models – 738 examples
    insufficient for 12.4K params
  3. Early stopping catches overfitting – Train: 74.9%,
    Val: 56.8% (18% gap)
  4. Error analysis > hyperparameter tuning – Found length
    gap, mislabeled examples
  5. High accuracy on small data is suspicious – Always
    validate
  6. Parameters >> examples = overfitting – Need 10-100
    examples per parameter


Part VII: Reproducibility

To ensure full reproducibility, here are the exact
configurations:

Environment Setup

# Python version
python==3.10.12

# Core dependencies
torch==2.1.0
sentence-transformers==2.2.2
scikit-learn==1.3.2
numpy==1.24.3
pandas==2.0.3

# Install
pip install torch sentence-transformers scikit-learn numpy pandas

Random Seed Control

import random
import numpy as np
import torch

def set_all_seeds(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_all_seeds(42)

Exact Model Configuration

# ISL-MLP Architecture
model = nn.Sequential(
    nn.Linear(384, 32),      # 12,320 params
    nn.ReLU(),
    nn.Dropout(0.1),
    nn.Linear(32, 3)         # 99 params
)
# Total: 12,419 params

# Training hyperparameters
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,                 # Best from ISL sweep
    weight_decay=1e-4        # Best from ISL sweep
)

# Early stopping
patience = 5
monitor = 'val_loss'
restore_best_weights = True


Part VIII: Next Steps

For This Project

Immediate (Week 1):

  1. ✅ Deploy Random Forest baseline (58.1%)
  2. 🔄 Fix 10 mislabeled examples identified in error analysis
  3. 🔄 Add 200 more long examples (address 19.7% gap)

Short-term (Month 1):

  1. 📋 Expand to 2,500 examples
  2. 📋 Retrain all baselines
  3. 📋 Re-evaluate if neural becomes viable (target: 70%+)

Long-term (Quarter 1):

  1. 📋 Production monitoring dashboard
  2. 📋 Active learning pipeline
  3. 📋 Continuous improvement loop

For You (The Reader)

If you’re building ML models on small
data:

1. Clone our approach:

# Repository coming soon on Codeberg
# https://codeberg.org/ishrikantbhosale/beastbullet
# Check back at potatobullet.com for updates

2. Try with your own data:

  • Replace SNLI with your dataset
  • Run baseline comparison first
  • Apply ISL principles

3. Compare with your own baselines:

  • Always test Logistic Regression, Random Forest, SVM
  • Set ISL threshold = best_baseline + 8%
  • Only use neural if it beats threshold

4. Read Part 2 in series:

  • Coming soon: “BEASTBULLET Expert Training”
  • How we trained 18 specialized micro-experts
  • Mobile deployment strategies

Questions or want to
collaborate?

📧 Email: bhosale@potatobullet.com

💻 Codeberg: codeberg.org/ishrikantbhosale (Repository coming
soon)

🌐 Follow our journey: potatobullet.com


Conclusion

BEASTBULLET taught us:

  • Privacy and power aren’t mutually exclusive
  • Consumer hardware is underrated
  • Failures teach more than successes
  • Scientific rigor beats intuition
  • ISL principles prevent wasted effort

💡

Final Lesson: This is not a product. This
is a discipline.

The real value isn’t the 58.1% accuracy. It’s the framework
that prevented us from deploying a 63.3% overfitted disaster.

“If your model can’t
beat logistic regression, your architecture choice needs rethinking.”

This is a research
narrative, not content marketing. We document failures, not just successes.

About the authors:
Shrikant Bhosale and his team are building privacy-first AI systems using consumer hardware and ISL
principles.

Share this article:

Leave a Comment