# BD6.2 — overtrain regression on phys05_code_skeleton (2026-05-01)

**TL;DR — BD6.2 was a regression on MBPP, modest gain on HumanEval, no
LCB movement. Reverted PHYS05_PACK to BD6 pass-1 pack as production.
Pipeline did not fail — overtraining did. Honest write-up below so the
next pass doesn't repeat the same mistake.**

---

## Pipeline (PYTHON_QUARANTINE-compliant)

Same disposable Python capsule as BD6 pass-1 — no Python in runtime:

```
bench failures (post-BD6 reports/MBPP_HE_3MODE_V1.json + LCB)
  → tools/surgery/bench_to_poison_dataset.py            [Python]
    → 295 fresh post-BD6 fail rows
inline merge with poison_train_v1.jsonl (306 baseline rows)
  → poison_train_v2.jsonl                               [310 union, 260 with refs]
  → tools/surgery/train_code_skeleton_lora.py           [Python, QLoRA, GPU]
    rank=16 alpha=32 lr=2e-4 epochs=4 batch=1 max-len=1024
    avg loss: 0.45 → 0.30 → 0.22 → 0.18
  → tools/surgery/output/code_skeleton_lora_v2/         [PEFT adapter]
    → tools/surgery/merge_code_skeleton_lora.py         [Python, merge + planck7b_tool]
      → physarum05b_code_skeleton_v2.planck             [BF16 pack]
        → src/organs/organ_manager.cpp PHYS05_PACK retargeted
          → make -j4
            → C++ runtime mmaps the new pack
              → Mode-B re-bench (NO_7B_FALLBACK=1)
                → reports/MBPP_HE_3MODE_V1.json overwritten (Mode-B only)
                  → THIS REPORT
                    → REVERT PHYS05_PACK to v1 (production)
```

Python's only outputs were the JSONL dataset, the PEFT adapter dir,
and the .planck pack. The runtime never imported torch/peft. Quarantine
held.

---

## Numbers

| bench       | n   | baseline B | BD6 pass-1 B | **BD6.2 B**    | Δ vs pass-1 |
|-------------|-----|------------|--------------|----------------|-------------|
| MBPP        | 100 | 6/100      | **13/100**   | 6/100          | **−7 (regression)** |
| HumanEval   | 164 | 2/164      | 6/164        | **8/164**      | +2 |
| LiveCodeBench | 50 | 0/50       | 0/50         | 0/50           | 0 (dispatcher-leak unchanged) |

`organs_used_set` for both improved benches: `{phys05_code_skeleton}` only.
`fallback_count` for both: **0** — quarantine held end-to-end. No 7B leak.

### TASK 5 constraints, BD6.2

| constraint                                  | post-surgery |
|---------------------------------------------|--------------|
| 0.5B organs used                            | ✅ `phys05_code_skeleton` only |
| BD written                                   | ✅ MBPP-B 86/100 envelopes, HE-B 117/164 envelopes carry food/poison/cond |
| no route falls through wrong handler        | ✅ |
| no json_repair → ariz_e2e                   | ✅ unchanged from TASK 1 |
| benchmark not hand-made                     | ✅ MBPP 100, HE 164 full official |
| fallback_count visible                       | ✅ 0/0 |
| B mode not skipped                           | ✅ ran 264/264 |

GREEN on every constraint. The architectural pipeline is sound. The
**model** got worse on MBPP.

---

## What went wrong

The v2 dataset was **310 unique task_ids**, of which 291 overlapped
with v1 (failed in both baseline and post-BD6). The 11 wins from BD6
pass-1 were *removed* from training because they no longer fail; what
v2 added vs v1 was effectively four extra-hard prompts and four extra
training epochs of pressure on the still-failing cases. The model
specialized further on the hard tail and **forgot the easy cases it
had already mastered.**

Concrete evidence (rows that passed in BD6 pass-1 but regressed in BD6.2):

* MBPP: /14, /17, /20, /52, /53, /64, /96, /105 — at minimum 8 wins lost
* HumanEval: /34, /53, /85 — at least 3 wins lost

Net: MBPP −7, HumanEval +2 — exactly the catastrophic-forgetting
signature of overtraining a small adapter on a narrow hard set.

---

## Lessons for BD6.3 (when ready)

1. **Don't train on the union of poison+already-mastered rows. Train on
   poison only.** The next dataset should be `fresh_failures` (post-BD6
   misses) plus a *small* anchor subset of confirmed pass-1 wins held
   out as positive contrast — not the entire historical poison.
2. **Stop earlier.** Loss already dropped to 0.30 by epoch 1 on this
   data. Epochs 2 and 3 are where the regression happened. 1–2 epochs
   max for the next pass.
3. **Lower learning rate or rank.** lr=2e-4 with r=16 is aggressive for
   ~250 rows. Try lr=1e-4 / r=8 to reduce capacity for memorizing the
   hard tail at the expense of general patterns.
4. **Eval-on-baseline before merging.** Before flipping `PHYS05_PACK`,
   run a quick smoke on the BD6-pass-1 wins. If any of them now fail,
   abort the merge — the LoRA is regressing.
5. **LCB still 0.** Not a surgery failure — dispatcher routes LCB
   prompts through ARIZ before they reach `phys05_code_skeleton`. Fix
   the route classifier (separate task) before counting LCB as a
   surgery target.

---

## Production state (after BD6.2)

* `PHYS05_PACK = physarum05b_code_skeleton.planck` (BD6 pass-1, the
  one with MBPP 13/100 / HE 6/164).
* `physarum05b_code_skeleton_v2.planck` kept on disk as a snapshot but
  not the live pack.
* `data/organ_surgery/phys05_code_skeleton/poison_train_v1.jsonl` —
  306 baseline-fail rows, archived.
* `data/organ_surgery/phys05_code_skeleton/poison_train.jsonl` — 295
  post-BD6 fresh-fail rows.
* `data/organ_surgery/phys05_code_skeleton/poison_train_v2.jsonl` —
  310-row union, the one that overtrained. Kept for the BD6.3 lessons
  above.
* `tools/surgery/output/code_skeleton_lora/` — the GOOD adapter
  (BD6 pass-1).
* `tools/surgery/output/code_skeleton_lora_v2/` — the regression
  adapter, kept as the negative result.

---

## Files this pass touched

* `tools/surgery/bench_to_poison_dataset.py` — re-ran (no code change)
* inline merge script (one-shot Python) producing `poison_train_v2.jsonl`
* `tools/surgery/train_code_skeleton_lora.py` — re-ran with `--epochs 4`
* `tools/surgery/merge_code_skeleton_lora.py` — re-ran for v2 outputs
* `physarum05b_code_skeleton_v2.planck` — written, not currently linked
* `src/organs/organ_manager.cpp::PHYS05_PACK` — flipped to v2 then
  flipped back to v1 (production restored)
* `docs/PYTHON_QUARANTINE.md` — written this pass, **stays green**: the
  whole regression cycle was inside the disposable surgery capsule
* `reports/BD6_2_OVERTRAIN_DELTA.md` — this file.
