BD7.3 — TRIZ 9-epoch retrain, OVERTRAIN regression (2026-05-04)

TL;DR — Followed up BD7's 6-epoch v2 (88/100) with a 9-epoch v3 hoping to close 12 strict-JSON failures. Loss kept descending (2.35 → 1.52) but generalization collapsed: T8 v3 = 58/100, organ_leaks 10. −30 points vs v2. Reverted PHYS05_TRIZ_PACK to v2 immediately. v2 stays in production. This is the BD6.2 pattern again — descending loss is a necessary but not sufficient signal; the model began to memorize specific token patterns and lost generality.

Numbers

| run | epochs | avg_loss (final) | T7/T8 strict 6-field | organ_leaks | fb_total | |-----|--------|------------------|----------------------|-------------|----------| | T7 v2 | 6 | 1.70 | 88/100 | 0 | 0 | | T8 v3 | 9 | 1.52 | 58/100 (−30) | 10 | 0 |

Same training data (triz_train_80.jsonl). Same trainer. Same merge script. Same decoder spec. Only knob changed: --epochs 6 → --epochs 9.

What overtraining looked like

Loss curve v3:

ep0  2.35 →
ep1  2.07
ep2  1.94
ep3  1.84  (≈ v2 final point)
ep4  1.77
ep5  1.70  (= v2 final)
ep6  1.64
ep7  1.57
ep8  1.52

So v3 ran the model 0.18 nats below v2's stopping point. That extra descent is the suspect zone — the LoRA started memorizing token-level patterns from the 80 training rows and stopped producing schema-correct outputs on novel ARIZ shapes.

Failure modes that appeared in T8 v3 (vs absent in v2)

organ_leaks = 10: outputs leaked donor-identity tokens

("user", "assistant", chat-template wrappers). The 6-epoch v2 was clean.

Pass rate split: among 42 fails, 30 were schema/leak fail and 12

were no_json truncation — same shape as v2's tail, same symptoms, but on 30 more tasks.

Decision (per gate doctrine)

REVERT to v2. PHYS05_TRIZ_PACK pinned back to physarum05b_triz_contradiction_v2.planck. Binary rebuilt 20:48. v3 pack and adapter retained on disk for autopsy:

tools/surgery/output/triz_lora_v3/                       (PEFT adapter)
tools/surgery/output/Physarum05B-TrizContradiction-v3/   (merged HF dir)
physarum05b_triz_contradiction_v3.planck                 (988 MB pack)
reports/BD7_TRIZ_T8_V3_9EP_N100.json                     (raw bench)

Production state (after revert)

PHYS05_TRIZ_PACK = physarum05b_triz_contradiction_v2.planck   (BD7 frozen)
                   ARIZ T7 strict 6-field 88/100, fb=0, leaks=0

Lesson

For r=8 / α=16 LoRA on 80 supervised rows, the empirical sweet spot on this data shape is 6 epochs, not 9. Loss descent past that point is memorization, not learning. To genuinely lift TRIZ above 88/100 we'd need:

More training data — 80 rows is the cap on what r=8 LoRA can

generalize from. Add 30-50 more strict-curated targets.

Larger rank — r=16 / α=32 with 6 epochs might allow finer

discrimination without memorization. Untested.

Curriculum on tail — train an additional pass only on the 12

v2 fails with hand-curated correct closures. Risk: same memorization trap.

These are queued but not blocking. v2's 88/100 is the production gate result. The TRIZ organ stays alive at 88/100 and fallback=0.

This is the second documented overtrain regression in the project (after BD6.2). Recording the pattern so the same gate-failure doesn't get repeated under a different name.