BD7.3 — TRIZ 9-epoch retrain, OVERTRAIN regression (2026-05-04)
TL;DR — Followed up BD7's 6-epoch v2 (88/100) with a 9-epoch v3 hoping to close 12 strict-JSON failures. Loss kept descending (2.35 → 1.52) but generalization collapsed: T8 v3 = 58/100, organ_leaks 10. −30 points vs v2. Reverted PHYS05_TRIZ_PACK to v2 immediately. v2 stays in production. This is the BD6.2 pattern again — descending loss is a necessary but not sufficient signal; the model began to memorize specific token patterns and lost generality.
Numbers
| run | epochs | avg_loss (final) | T7/T8 strict 6-field | organ_leaks | fb_total | |-----|--------|------------------|----------------------|-------------|----------| | T7 v2 | 6 | 1.70 | 88/100 | 0 | 0 | | T8 v3 | 9 | 1.52 | 58/100 (−30) | 10 | 0 |
Same training data (triz_train_80.jsonl). Same trainer. Same merge script. Same decoder spec. Only knob changed: --epochs 6 → --epochs 9.
What overtraining looked like
Loss curve v3:
ep0 2.35 →
ep1 2.07
ep2 1.94
ep3 1.84 (≈ v2 final point)
ep4 1.77
ep5 1.70 (= v2 final)
ep6 1.64
ep7 1.57
ep8 1.52
So v3 ran the model 0.18 nats below v2's stopping point. That extra descent is the suspect zone — the LoRA started memorizing token-level patterns from the 80 training rows and stopped producing schema-correct outputs on novel ARIZ shapes.
Failure modes that appeared in T8 v3 (vs absent in v2)
- organ_leaks = 10: outputs leaked donor-identity tokens
("user", "assistant", chat-template wrappers). The 6-epoch v2 was clean.
- Pass rate split: among 42 fails, 30 were schema/leak fail and 12
were no_json truncation — same shape as v2's tail, same symptoms, but on 30 more tasks.
Decision (per gate doctrine)
REVERT to v2. PHYS05_TRIZ_PACK pinned back to physarum05b_triz_contradiction_v2.planck. Binary rebuilt 20:48. v3 pack and adapter retained on disk for autopsy:
tools/surgery/output/triz_lora_v3/ (PEFT adapter)
tools/surgery/output/Physarum05B-TrizContradiction-v3/ (merged HF dir)
physarum05b_triz_contradiction_v3.planck (988 MB pack)
reports/BD7_TRIZ_T8_V3_9EP_N100.json (raw bench)
Production state (after revert)
PHYS05_TRIZ_PACK = physarum05b_triz_contradiction_v2.planck (BD7 frozen)
ARIZ T7 strict 6-field 88/100, fb=0, leaks=0
Lesson
For r=8 / α=16 LoRA on 80 supervised rows, the empirical sweet spot on this data shape is 6 epochs, not 9. Loss descent past that point is memorization, not learning. To genuinely lift TRIZ above 88/100 we'd need:
- More training data — 80 rows is the cap on what r=8 LoRA can
generalize from. Add 30-50 more strict-curated targets.
- Larger rank — r=16 / α=32 with 6 epochs might allow finer
discrimination without memorization. Untested.
- Curriculum on tail — train an additional pass only on the 12
v2 fails with hand-curated correct closures. Risk: same memorization trap.
These are queued but not blocking. v2's 88/100 is the production gate result. The TRIZ organ stays alive at 88/100 and fallback=0.
This is the second documented overtrain regression in the project (after BD6.2). Recording the pattern so the same gate-failure doesn't get repeated under a different name.