BEASTBULLET Validator: From 35% to 99% Accuracy in 6 Hours
By Shrikant Bhosale | December 20, 2024 | Technical Deep Dive
| Part 2 of BEASTBULLET Series
What This Article Is
This is the complete technical journey
of training the BEASTBULLET validator expert – from initial failure at 35% accuracy to breakthrough at
99%. You’ll learn:
- How we debugged the “Flat Earth Bug” (model said Earth is flat was
VALID) - Why 200 examples with semantic embeddings beat 500 with random noise
- The critical mistake that wasted 4 hours of training
- Exact code you can copy-paste to avoid our mistakes
- How sentence-transformers saved the project
about embeddings
hours)
🎯 Tactical Takeaways (Read This First)
If you’re training a small model on limited
data:
1. Use semantic embeddings from day 1
Don’t waste time with random vectors
# ✅ DO THIS (saves 4 hours): from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') embedding = model.encode(text) # ❌ NOT THIS: embedding = torch.randn(256) # Random noise
2. Use professional datasets
SNLI, GLUE, SQuAD beat synthetic generation
- SNLI: 30 seconds to download vs 16 minutes to generate
- Pre-labeled, balanced, high-quality
3. Test on edge cases immediately
We caught the “Flat Earth Bug” early
test_cases = [
("The Earth is flat", "INVALID"),
("Paris is the capital of France", "VALID"),
("The sky might be blue", "UNCERTAIN")
]
4. Monitor both accuracy AND predictions
35% accuracy + wrong predictions = broken embeddings
Table of Contents
Part I: The Mission
What is the Validator Expert?
The validator is a critical component of BEASTBULLET – a 314KB micro-model that judges
whether an AI-generated answer is:
- VALID – Factually correct
- INVALID – Factually wrong
- UNCERTAIN – Needs more context
Why it matters:
- Runs on mobile devices (no cloud needed)
- Catches hallucinations before they reach users
- Enables privacy-first AI
The Challenge
Train a model that can:
- Distinguish truth from falsehood
- Run in under 200ms on CPU
- Fit in under 500KB
- Achieve 90%+ accuracy
Timeline: 6 hours
Starting accuracy: 35% (random
guessing)
Target accuracy: 90%+
Part II: Phase 1 – The 35% Disaster
Attempt 1.1: Ollama Synthetic Generation
# Generate training data using Ollama Mistral for i in range(500): prompt = "Generate a fact-checking example..." response = ollama.generate(prompt, timeout=30)
❌ Problem 1: Timeouts
Ollama responses took 40-60 seconds. Our 30-second timeout was too aggressive.
Solution: Increased timeout to 90 seconds
❌ Problem 2: Slow Generation
1 example per call = 500 seconds for 500 examples. Too slow for iteration.
Result: Generated only 20 examples before abandoning
approach.
Attempt 1.2: Training with Random Embeddings
# What we did (WRONG): embedding = torch.randn(1, 256) # Random noise! # Every input gets a different random vector # No relationship to actual text content
Training Results:
Epoch 1: 35% accuracy
Epoch 5: 38% accuracy
Epoch 10: 42% accuracy
Epoch 20: 35% accuracy ← Back to random guessing
🚨 The “Flat Earth Bug”
Input: “The Earth is flat”
Model prediction: VALID
Confidence: +0.34
Expected: INVALID with negative
confidence
The model couldn’t distinguish truth from falsehood because embeddings
were random noise.
Root Cause: Random embeddings provide zero semantic information.
Part III: Phase 2 – Scaling Didn’t Help
Attempt 2.1: Batch Generation (Failed)
# Generate 50 examples per Ollama call
prompt = "Generate 50 fact-checking examples in JSON format..."
response = ollama.generate(prompt, timeout=180)
❌ Problem: Still Timing Out
Even at 240 seconds, Ollama couldn’t handle batch complexity. JSON parsing also
failed frequently.
Result: Abandoned Ollama entirely.
Attempt 2.2: SNLI Dataset (Success!)
The Breakthrough:
Instead of generating synthetic data, we used the Stanford Natural Language Inference
(SNLI) dataset:
from datasets import load_dataset
# Download professional-grade dataset
dataset = load_dataset("snli", split="train", streaming=True)
# Convert labels:
# entailment (0) → VALID
# contradiction (2) → INVALID
# neutral (1) → UNCERTAIN
✅ Results:
- 200 examples in < 30 seconds (vs 16 minutes with Ollama)
- Perfect distribution: 33% VALID, 33% INVALID, 34% UNCERTAIN
- Professional quality, pre-labeled data
Attempt 2.3: SNLI + Random Embeddings
The Test:
- Dataset: 200 SNLI examples (10x larger than before)
- Embeddings: Still random (torch.randn(256))
- Epochs: 50 (increased from 20)
Epoch 1: 29% accuracy
Epoch 10: 31% accuracy
Epoch 20: 34% accuracy
Epoch 50: 30% accuracy ← Still random guessing
⚠️ Critical Realization
Dataset quality doesn’t matter if embeddings are
random!
200 professional examples with random embeddings = 20 synthetic examples
with random embeddings = complete failure
Lesson: The problem wasn’t the data. It was the embeddings.
Part IV: Phase 3 – The Semantic Breakthrough
The Fix: Sentence-Transformers
./venv/bin/pip install sentence-transformers
Model: all-MiniLM-L6-v2
- Size: 80MB
- Output: 384-dimensional semantic vectors
- Speed: Fast on CPU
Pre-computing Embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
for example in dataset:
# Combine draft answer + context
text = f"{example['draft_answer']} [SEP] {example['context']}"
# Generate REAL semantic embedding
embedding = model.encode(text, convert_to_numpy=True)
example['embedding'] = embedding.tolist()
✅ Results:
- 200 embeddings in ~30 seconds
- File size: 2.3 MB (vs 66.8 KB without embeddings)
- Real semantic representations
Training with Semantic Embeddings
# BEFORE (Random): embedding = torch.randn(1, 256).to(device) # AFTER (Semantic): embedding = torch.tensor( ex['embedding'], dtype=torch.float32 ).unsqueeze(0).to(device)
The Results – BREAKTHROUGH
Epoch 1: 31.0% accuracy (starting from random) Epoch 5: 67.0% accuracy (learning begins!) Epoch 7: 75.0% accuracy (target exceeded!) Epoch 10: 83.5% accuracy (excellent) Epoch 20: 91.0% accuracy (near-perfect) Epoch 33: 97.0% accuracy (production-ready) Epoch 47: 99.0% accuracy (BEST) ✅
✅ Success Metrics
- Accuracy: 99% (vs 35% with random embeddings)
- Training time: 2 minutes on CPU
- Model size: 314.5 KB
- Parameters: 78,916 (~79K)
Test Predictions
Before (Random Embeddings)
Input: “The Earth is flat”
Prediction: VALID ❌
Confidence: +0.34
After (Semantic Embeddings)
Input: “The Earth is flat”
Prediction: INVALID ✅
Confidence: -0.89
Perfect!
Part V: Lessons Learned
1. Embeddings Matter More Than Data
Key Insight: 200 examples with semantic embeddings > 500
examples with random noise
Quality of representation > Quantity of data
2. ISL Framework Validation
The Architecture:
Large Model (80MB) → Generates Embeddings (one-time on PC)
↓
Small Model (314KB) → Learns Judgment (runs on mobile)
Perfect separation of concerns:
- Heavy lifting (embedding generation): PC with sentence-transformers
- Lightweight inference (validation): Mobile with 314KB model
3. Use Professional Datasets
Lesson: Don’t reinvent the wheel. Use established datasets.
4. Identify Root Causes
Symptom: Low accuracy (35%)
Wrong diagnosis: Dataset too small
Correct diagnosis: Random embeddings
Evidence:
- Scaling from 20 → 200 examples: No improvement
- Switching to semantic embeddings: 35% → 99%
Warning: Don’t fix symptoms. Fix root
causes.
5. Rapid Learning with Good Representations
Timeline:
- Epoch 7: 75% accuracy (7 minutes)
- Epoch 47: 99% accuracy (47 minutes)
Proof: Small models can excel with good input
representations.
Part VI: Technical Specifications
Final Model
class ValidatorExpert(nn.Module):
def __init__(self):
super().__init__()
self.network = nn.Sequential(
nn.Linear(384, 256), # Semantic input
nn.GELU(),
nn.LayerNorm(256),
nn.Dropout(0.1),
nn.Linear(256, 128),
nn.GELU(),
nn.LayerNorm(128),
nn.Linear(128, 64),
nn.GELU(),
nn.Linear(64, 3) # VALID, INVALID, UNCERTAIN
)
self.confidence_head = nn.Linear(64, 1)
Parameters: 78,916 (~79K)
Size: 314.5 KB
Input: 384-dim semantic vectors
Output: 3-class verdict + confidence
Accuracy: 99%
Training time: 2 minutes (CPU)
Comparison: Before vs After
Timeline Summary
Conclusion
Journey Summary:
- Started: 35% accuracy with random embeddings
- Struggled: Ollama timeouts, slow generation, wrong
diagnosis - Breakthrough: Semantic embeddings from
sentence-transformers - Achieved: 99% accuracy, production-ready model
Final Lesson
Embedding quality matters more than dataset
size.
200 examples with semantic embeddings beat 500 with random noise. Always
fix the representation first.
Status: ✅
PRODUCTION READY
Next: Hardening with adversarial examples (coming in Part 3)
Resources
Code:
- Validator Training (Coming soon on Codeberg)
- SNLI Converter (Coming soon on Codeberg)
- Embedding Generator (Coming soon on Codeberg)
Related Posts:
Questions or want to
collaborate?
📧 Email: bhosale@potatobullet.com
💻 Codeberg: codeberg.org/ishrikantbhosale (Repository coming
soon)
🌐 Follow our journey: potatobullet.com
This is a technical
deep dive, not content marketing. We document failures and breakthroughs.
About the author: Shrikant Bhosale is
building privacy-first AI systems using consumer hardware and ISL principles.