Frozen pretrained encoders + Mahalanobis prototype beat every algorithmic sophistication we tried, across 4 datasets and 50+ experiments.
nanoLearn is an empirical study of class-incremental intent classification. We ran 50+ experiments testing feature-based and algorithmic approaches on four datasets: BANKING77 (77 classes), CLINC150 (150 classes), HWU64 (64 classes), and AG News (4-class topic classification).
Main finding: a frozen pretrained sentence encoder combined with a Mahalanobis prototype classifier (running class means + shrunken pooled covariance, updated via Welford's online algorithm) matches the frozen-encoder linear-head joint-training oracle (92.83%) while being perfectly class-incremental, order-invariant, granularity-invariant, and deployable in 5MB. Concatenating three encoders (MiniLM-L6 384d + mpnet-base 768d + e5-large 1024d = 2176d) reaches 94.22% on the official BANKING77 split, 93.33% on 5-fold cross-validation.
Twelve algorithmic sophistications failed and are documented with numbers: LoRA fine-tuning (-24.5pp at best of 9 configs), multi-prototype Mahalanobis (-0.7pp), hierarchical 2-stage classification (-1.8pp), weighted encoder fusion (-2.2pp), cross-encoder reranking (-21pp), kNN-Mahalanobis (-2.6pp), hard-margin replay (-3.6pp vs random reservoir), label noise cleanup (-0.2pp), LLM cascade with Qwen 2.5 3B (-3.8pp), PCA/LDA projection (-3.1pp), bigger encoders alone (bge-large -0.6pp). Five feature-based improvements worked: better encoder (+2.24pp), encoder diversity (+0.94pp), shrinkage tuning (+0.13pp), ICDA augmentation (+0.36pp), multi-label top-K (+3.5pp at set-size 2).
We do not reach fine-tuned SOTA (~95-96%); frozen features cannot encode task-specific word discriminations (e.g., "declined_transfer" vs "declined_card_payment" differ only in the object word). The 2pp gap reflects the frozen-feature ceiling, not a training bug. Published SOTA requires gradient fine-tuning of the encoder, which costs 700 hours of CPU time in our setup.
Production characterization: MiniLM + Mahalanobis gives 455 QPS at 90.5% accuracy with 87MB memory; 3-encoder concat gives 9 QPS at 94.22% with 700MB. Adding a new intent takes 530ms for 100 training examples. Mahalanobis distance itself is microseconds; the encoder forward pass dominates all latency.
Honest caveat: the official BANKING77 test set is 4.36pp easier than its training set under role swap. Both official-split and 5-fold CV numbers should always be reported.
| Dataset | Classes | Task | Catastrophic | Replay | Prototype (best) |
|---|---|---|---|---|---|
| BANKING77 | 77 | intent | 32.5% | 68.9% | 94.22% |
| CLINC150 | 150 | intent | 60.9% | 80.1% | 95.56% |
| HWU64 | 64 | intent | 62.8% | 83.3% | 90.61% |
| AG News | 4 | topic | 66.7% | 82.0% | 90.22% |
Ranking prototype >> replay >> catastrophic holds on all 4. Gap shrinks as classes decrease (AG News has only 4 classes and 8pp gap to replay, vs 22pp on BANKING77).
| Configuration | Accuracy | Delta |
|---|---|---|
| Frozen MiniLM + Mahalanobis (baseline) | 90.39% | — |
| + shrinkage tuning (1e-4) | 90.52% | +0.13pp |
| + ICDA augmentation | 90.88% | +0.36pp |
| Swap to mpnet | 92.76% | +2.24pp |
| Concat MiniLM+mpnet | 93.28% | +0.52pp |
| Concat MiniLM+mpnet+e5-large | 94.22% | +0.94pp |
| Merge top-30 confused pairs (77→49 groups) | 96.10% | +1.88pp* |
| Multi-label top-2 (T=0.05, set size ~2) | 97.73% | —** |
* Redefines task (49 groups vs 77 classes). ** Strict top-2 accuracy with prediction sets.
| Methodology | Accuracy | Note |
|---|---|---|
| Official BANKING77 split (point) | 94.22% | headline |
| Bootstrap over test (100 resamples) | 94.18% +/- 0.47pp | 95% CI: [93.28, 95.06] |
| Train/test role swap | 89.86% | -4.36pp gap — split is not symmetric |
| 5-fold CV (honest number) | 93.33% +/- 0.58pp | report both with the headline |
The official BANKING77 test set is systematically easier than its training set. The honest, methodology-independent number is 93.33% via 5-fold CV.
| LoRA fine-tuning (best of 9) | -24.5pp |
| Multi-prototype Mahalanobis | -0.7pp |
| Hard-margin replay | -3.6pp |
| kNN-Mahalanobis | -2.6pp |
| Hierarchical 2-stage | -1.8pp |
| Weighted fusion | -2.2pp |
| Cross-encoder rerank | -21pp |
| LLM cascade (Qwen 3B) | -3.8pp |
| Label noise cleanup | -0.2pp |
| PCA/LDA projection | -3.1pp |
| Better encoder (MiniLM→mpnet) | +2.24pp |
| + e5-large | +0.94pp |
| Shrinkage tuning | +0.13pp |
| ICDA augmentation | +0.36pp |
| Margin-based OOD | AUROC 0.78→0.85 |
| Multi-label top-K | +3.5pp at set=2 |
| Config | Cold start | E2E@1 | QPS@128 | Memory |
|---|---|---|---|---|
| MiniLM (384d) | 54ms | 24ms | 455 | 87MB + 0.7MB |
| mpnet (768d) | 903ms | 153ms | 32 | 418MB + 2.5MB |
| 3-enc concat (2176d) | 12.3s | 627ms | 9 | 700MB + 18MB |
Learn a new class: 530ms for 100 examples. Accuracy isn't free: +3.83pp from MiniLM to concat costs 50x throughput.
VERIFIED Order-invariant — 7 orderings produce identical 90.5195% accuracy to 6 decimal places. No curriculum sensitivity.
VERIFIED Granularity-invariant — 1/7/11/22/77 classes per task all identical. Task size doesn't matter.
VERIFIED Few-shot capable — k=10 examples/class crosses 80%. k=2 reaches 63%.
WEAK (improved) OOD rejection — AUROC 0.78 baseline, 0.85 with margin score. Still needs dedicated head for production.
Feature quality >> Learning algorithm. 50+ experiments confirm: every improvement came from better or more features; every attempt to make the algorithm smarter either failed or barely helped. The winning formula is embarrassingly simple — freeze a good encoder, accumulate running class means via Welford, classify via Mahalanobis distance with a shared covariance.
This doesn't beat fine-tuned SOTA (95-96%) which requires gradient training we can't run on CPU. But it matches the linear-head joint oracle (92.83%) while being perfectly class-incremental, 50x faster to deploy, and requiring 5MB of state. For production CIL, this wins.
pip install -r requirements.txt
# Synthetic benchmark
python run.py train --arch catastrophic_mlp --seed 42
python run.py leaderboard
# BANKING77 full validation
PYTHONPATH=. python -m experiments.banking77
PYTHONPATH=. python -m experiments.banking77_3enc # 94.22% concat
PYTHONPATH=. python -m experiments.banking77_variance_audit # honest CV 93.33%
# Cross-dataset: does it hold?
PYTHONPATH=. python -m experiments.clinc150 # 95.56%
PYTHONPATH=. python -m experiments.hwu64 # 90.61%
PYTHONPATH=. python -m experiments.ag_news # 90.22% (non-intent)