# BD7.3 — TRIZ 9-epoch retrain, OVERTRAIN regression (2026-05-04)

**TL;DR — Followed up BD7's 6-epoch v2 (88/100) with a 9-epoch v3 hoping
to close 12 strict-JSON failures. Loss kept descending (2.35 → 1.52)
but generalization collapsed: T8 v3 = 58/100, organ_leaks 10. −30 points
vs v2. Reverted PHYS05_TRIZ_PACK to v2 immediately. v2 stays in
production. This is the BD6.2 pattern again — descending loss is a
necessary but not sufficient signal; the model began to memorize
specific token patterns and lost generality.**

## Numbers

| run | epochs | avg_loss (final) | T7/T8 strict 6-field | organ_leaks | fb_total |
|-----|--------|------------------|----------------------|-------------|----------|
| T7  v2 | 6 | 1.70 | **88/100** | 0 | 0 |
| T8  v3 | 9 | **1.52** | **58/100 (−30)** | **10** | 0 |

Same training data (`triz_train_80.jsonl`). Same trainer. Same merge
script. Same decoder spec. Only knob changed: `--epochs 6` → `--epochs 9`.

## What overtraining looked like

Loss curve v3:
```
ep0  2.35 →
ep1  2.07
ep2  1.94
ep3  1.84  (≈ v2 final point)
ep4  1.77
ep5  1.70  (= v2 final)
ep6  1.64
ep7  1.57
ep8  1.52
```

So v3 ran the model 0.18 nats below v2's stopping point. That extra
descent is the suspect zone — the LoRA started memorizing token-level
patterns from the 80 training rows and stopped producing schema-correct
outputs on novel ARIZ shapes.

## Failure modes that appeared in T8 v3 (vs absent in v2)

* **organ_leaks = 10**: outputs leaked donor-identity tokens
  ("user", "assistant", chat-template wrappers). The 6-epoch v2 was
  clean.
* **Pass rate split**: among 42 fails, 30 were schema/leak fail and 12
  were `no_json` truncation — same shape as v2's tail, same symptoms,
  but on 30 more tasks.

## Decision (per gate doctrine)

REVERT to v2. PHYS05_TRIZ_PACK pinned back to
`physarum05b_triz_contradiction_v2.planck`. Binary rebuilt 20:48.
v3 pack and adapter retained on disk for autopsy:

```
tools/surgery/output/triz_lora_v3/                       (PEFT adapter)
tools/surgery/output/Physarum05B-TrizContradiction-v3/   (merged HF dir)
physarum05b_triz_contradiction_v3.planck                 (988 MB pack)
reports/BD7_TRIZ_T8_V3_9EP_N100.json                     (raw bench)
```

## Production state (after revert)

```
PHYS05_TRIZ_PACK = physarum05b_triz_contradiction_v2.planck   (BD7 frozen)
                   ARIZ T7 strict 6-field 88/100, fb=0, leaks=0
```

## Lesson

For **r=8 / α=16 LoRA on 80 supervised rows**, the empirical sweet spot
on this data shape is **6 epochs, not 9**. Loss descent past that point
is memorization, not learning. To genuinely lift TRIZ above 88/100 we'd
need:

1. **More training data** — 80 rows is the cap on what r=8 LoRA can
   generalize from. Add 30-50 more strict-curated targets.
2. **Larger rank** — r=16 / α=32 with 6 epochs might allow finer
   discrimination without memorization. Untested.
3. **Curriculum on tail** — train an additional pass *only* on the 12
   v2 fails with hand-curated correct closures. Risk: same memorization
   trap.

These are queued but not blocking. v2's 88/100 is the production gate
result. The TRIZ organ stays alive at 88/100 and fallback=0.

This is the **second documented overtrain regression** in the project
(after BD6.2). Recording the pattern so the same gate-failure doesn't
get repeated under a different name.
