nanoLearn

Frozen pretrained encoders + Mahalanobis prototype beat every algorithmic sophistication we tried, across 4 datasets and 50+ experiments.

Honest headline on BANKING77: 94.22% on the official split / 93.33% +/- 0.58pp on 5-fold CV with zero forgetting, zero gradient training, 5MB state, 530ms per new class. Validated on CLINC150, HWU64, AG News. Every "smarter" learning algorithm we tried lost to this simple recipe.

Abstract (citation-friendly summary)

nanoLearn is an empirical study of class-incremental intent classification. We ran 50+ experiments testing feature-based and algorithmic approaches on four datasets: BANKING77 (77 classes), CLINC150 (150 classes), HWU64 (64 classes), and AG News (4-class topic classification).

Main finding: a frozen pretrained sentence encoder combined with a Mahalanobis prototype classifier (running class means + shrunken pooled covariance, updated via Welford's online algorithm) matches the frozen-encoder linear-head joint-training oracle (92.83%) while being perfectly class-incremental, order-invariant, granularity-invariant, and deployable in 5MB. Concatenating three encoders (MiniLM-L6 384d + mpnet-base 768d + e5-large 1024d = 2176d) reaches 94.22% on the official BANKING77 split, 93.33% on 5-fold cross-validation.

Twelve algorithmic sophistications failed and are documented with numbers: LoRA fine-tuning (-24.5pp at best of 9 configs), multi-prototype Mahalanobis (-0.7pp), hierarchical 2-stage classification (-1.8pp), weighted encoder fusion (-2.2pp), cross-encoder reranking (-21pp), kNN-Mahalanobis (-2.6pp), hard-margin replay (-3.6pp vs random reservoir), label noise cleanup (-0.2pp), LLM cascade with Qwen 2.5 3B (-3.8pp), PCA/LDA projection (-3.1pp), bigger encoders alone (bge-large -0.6pp). Five feature-based improvements worked: better encoder (+2.24pp), encoder diversity (+0.94pp), shrinkage tuning (+0.13pp), ICDA augmentation (+0.36pp), multi-label top-K (+3.5pp at set-size 2).

We do not reach fine-tuned SOTA (~95-96%); frozen features cannot encode task-specific word discriminations (e.g., "declined_transfer" vs "declined_card_payment" differ only in the object word). The 2pp gap reflects the frozen-feature ceiling, not a training bug. Published SOTA requires gradient fine-tuning of the encoder, which costs 700 hours of CPU time in our setup.

Production characterization: MiniLM + Mahalanobis gives 455 QPS at 90.5% accuracy with 87MB memory; 3-encoder concat gives 9 QPS at 94.22% with 700MB. Adding a new intent takes 530ms for 100 training examples. Mahalanobis distance itself is microseconds; the encoder forward pass dominates all latency.

Honest caveat: the official BANKING77 test set is 4.36pp easier than its training set under role swap. Both official-split and 5-fold CV numbers should always be reported.

Results Across 4 Datasets

DatasetClassesTaskCatastrophicReplayPrototype (best)
BANKING7777intent32.5%68.9%94.22%
CLINC150150intent60.9%80.1%95.56%
HWU6464intent62.8%83.3%90.61%
AG News4topic66.7%82.0%90.22%

Ranking prototype >> replay >> catastrophic holds on all 4. Gap shrinks as classes decrease (AG News has only 4 classes and 8pp gap to replay, vs 22pp on BANKING77).

BANKING77 Progression

ConfigurationAccuracyDelta
Frozen MiniLM + Mahalanobis (baseline)90.39%
+ shrinkage tuning (1e-4)90.52%+0.13pp
+ ICDA augmentation90.88%+0.36pp
Swap to mpnet92.76%+2.24pp
Concat MiniLM+mpnet93.28%+0.52pp
Concat MiniLM+mpnet+e5-large94.22%+0.94pp
Merge top-30 confused pairs (77→49 groups)96.10%+1.88pp*
Multi-label top-2 (T=0.05, set size ~2)97.73%—**

* Redefines task (49 groups vs 77 classes). ** Strict top-2 accuracy with prediction sets.

Honest Variance: the split matters

MethodologyAccuracyNote
Official BANKING77 split (point)94.22%headline
Bootstrap over test (100 resamples)94.18% +/- 0.47pp95% CI: [93.28, 95.06]
Train/test role swap89.86%-4.36pp gap — split is not symmetric
5-fold CV (honest number)93.33% +/- 0.58ppreport both with the headline

The official BANKING77 test set is systematically easier than its training set. The honest, methodology-independent number is 93.33% via 5-fold CV.

What Failed (50+ experiments)

Algorithm sophistication: all failed

LoRA fine-tuning (best of 9)-24.5pp
Multi-prototype Mahalanobis-0.7pp
Hard-margin replay-3.6pp
kNN-Mahalanobis-2.6pp
Hierarchical 2-stage-1.8pp
Weighted fusion-2.2pp
Cross-encoder rerank-21pp
LLM cascade (Qwen 3B)-3.8pp
Label noise cleanup-0.2pp
PCA/LDA projection-3.1pp

Feature improvements: all worked

Better encoder (MiniLM→mpnet)+2.24pp
+ e5-large+0.94pp
Shrinkage tuning+0.13pp
ICDA augmentation+0.36pp
Margin-based OODAUROC 0.78→0.85
Multi-label top-K+3.5pp at set=2

Production Latency (Apple M-series CPU)

ConfigCold startE2E@1QPS@128Memory
MiniLM (384d)54ms24ms45587MB + 0.7MB
mpnet (768d)903ms153ms32418MB + 2.5MB
3-enc concat (2176d)12.3s627ms9700MB + 18MB

Learn a new class: 530ms for 100 examples. Accuracy isn't free: +3.83pp from MiniLM to concat costs 50x throughput.

Invariance Properties

VERIFIED Order-invariant — 7 orderings produce identical 90.5195% accuracy to 6 decimal places. No curriculum sensitivity.

VERIFIED Granularity-invariant — 1/7/11/22/77 classes per task all identical. Task size doesn't matter.

VERIFIED Few-shot capable — k=10 examples/class crosses 80%. k=2 reaches 63%.

WEAK (improved) OOD rejection — AUROC 0.78 baseline, 0.85 with margin score. Still needs dedicated head for production.

The Honest Insight

Feature quality >> Learning algorithm. 50+ experiments confirm: every improvement came from better or more features; every attempt to make the algorithm smarter either failed or barely helped. The winning formula is embarrassingly simple — freeze a good encoder, accumulate running class means via Welford, classify via Mahalanobis distance with a shared covariance.

This doesn't beat fine-tuned SOTA (95-96%) which requires gradient training we can't run on CPU. But it matches the linear-head joint oracle (92.83%) while being perfectly class-incremental, 50x faster to deploy, and requiring 5MB of state. For production CIL, this wins.

Quick Start

pip install -r requirements.txt

# Synthetic benchmark
python run.py train --arch catastrophic_mlp --seed 42
python run.py leaderboard

# BANKING77 full validation
PYTHONPATH=. python -m experiments.banking77
PYTHONPATH=. python -m experiments.banking77_3enc            # 94.22% concat
PYTHONPATH=. python -m experiments.banking77_variance_audit  # honest CV 93.33%

# Cross-dataset: does it hold?
PYTHONPATH=. python -m experiments.clinc150                   # 95.56%
PYTHONPATH=. python -m experiments.hwu64                      # 90.61%
PYTHONPATH=. python -m experiments.ag_news                    # 90.22% (non-intent)