Question 1

What accuracy does nanoLearn achieve on BANKING77?

Accepted Answer

94.22% on the official BANKING77 split with 3-encoder concatenation (MiniLM + mpnet + e5-large). The honest 5-fold cross-validation number is 93.33% +/- 0.58pp. The official BANKING77 test set is 4.36pp easier than its training set when roles are swapped, so both numbers should be reported.

Question 2

Does nanoLearn beat fine-tuned BERT SOTA on BANKING77?

Accepted Answer

No. Fine-tuned BERT or SetFit achieves 95-96% with gradient training of the encoder. Our frozen-encoder approach reaches 93.33% CV. However, nanoLearn matches the frozen-encoder linear-head joint oracle (92.83%) under class-incremental constraints with zero forgetting, order-invariance, and 5MB deployable state.

Question 3

Why does the Mahalanobis prototype beat LoRA fine-tuning on this task?

Accepted Answer

In the class-incremental protocol, LoRA fine-tuning induces representation-level forgetting that replay + knowledge distillation cannot stop. A fair 9-configuration hyperparameter sweep found best LoRA accuracy of 65.88%, 24.51pp below the frozen Mahalanobis baseline (90.39% for single-encoder MiniLM). The pretrained encoder already captures intent-relevant features; updating it incrementally destroys more than it creates.

Question 4

Does the approach generalize beyond intent classification?

Accepted Answer

Yes. Validated on 3 intent datasets (BANKING77 77 classes, CLINC150 150 classes, HWU64 64 classes) and 1 topic classification dataset (AG News 4 classes). Ranking prototype >> replay >> catastrophic holds on all four. The gap shrinks with fewer classes (8pp on AG News vs 22pp on BANKING77) but the winner is the same.

Question 5

Is the Mahalanobis prototype order-invariant?

Accepted Answer

Yes, proven empirically and analytically. 7 task orderings (sorted, reversed, 5 random permutations) produce identical 90.5195% accuracy to 6 decimal places. 5 task granularities (1, 7, 11, 22, 77 classes per task) also produce identical accuracy. This is a consequence of the Welford-style streaming update: running class means and pooled within-class scatter are commutative sufficient statistics.

Dataset	Classes	Task	Catastrophic	Replay	Prototype (best)
BANKING77	77	intent	32.5%	68.9%	94.22%
CLINC150	150	intent	60.9%	80.1%	95.56%
HWU64	64	intent	62.8%	83.3%	90.61%
AG News	4	topic	66.7%	82.0%	90.22%

Configuration	Accuracy	Delta
Frozen MiniLM + Mahalanobis (baseline)	90.39%	—
+ shrinkage tuning (1e-4)	90.52%	+0.13pp
+ ICDA augmentation	90.88%	+0.36pp
Swap to mpnet	92.76%	+2.24pp
Concat MiniLM+mpnet	93.28%	+0.52pp
Concat MiniLM+mpnet+e5-large	94.22%	+0.94pp
Merge top-30 confused pairs (77→49 groups)	96.10%	+1.88pp*
Multi-label top-2 (T=0.05, set size ~2)	97.73%	—**

Methodology	Accuracy	Note
Official BANKING77 split (point)	94.22%	headline
Bootstrap over test (100 resamples)	94.18% +/- 0.47pp	95% CI: [93.28, 95.06]
Train/test role swap	89.86%	-4.36pp gap — split is not symmetric
5-fold CV (honest number)	93.33% +/- 0.58pp	report both with the headline

LoRA fine-tuning (best of 9)	-24.5pp
Multi-prototype Mahalanobis	-0.7pp
Hard-margin replay	-3.6pp
kNN-Mahalanobis	-2.6pp
Hierarchical 2-stage	-1.8pp
Weighted fusion	-2.2pp
Cross-encoder rerank	-21pp
LLM cascade (Qwen 3B)	-3.8pp
Label noise cleanup	-0.2pp
PCA/LDA projection	-3.1pp

Better encoder (MiniLM→mpnet)	+2.24pp
+ e5-large	+0.94pp
Shrinkage tuning	+0.13pp
ICDA augmentation	+0.36pp
Margin-based OOD	AUROC 0.78→0.85
Multi-label top-K	+3.5pp at set=2

nanoLearn

Abstract (citation-friendly summary)

Results Across 4 Datasets

BANKING77 Progression

Honest Variance: the split matters

What Failed (50+ experiments)

Algorithm sophistication: all failed

Feature improvements: all worked

Production Latency (Apple M-series CPU)

Invariance Properties

The Honest Insight

Quick Start

Config	Cold start	E2E@1	QPS@128	Memory
MiniLM (384d)	54ms	24ms	455	87MB + 0.7MB
mpnet (768d)	903ms	153ms	32	418MB + 2.5MB
3-enc concat (2176d)	12.3s	627ms	9	700MB + 18MB