BD6.2 — overtrain regression on phys05_code_skeleton (2026-05-01)

TL;DR — BD6.2 was a regression on MBPP, modest gain on HumanEval, no LCB movement. Reverted PHYS05_PACK to BD6 pass-1 pack as production. Pipeline did not fail — overtraining did. Honest write-up below so the next pass doesn't repeat the same mistake.

Pipeline (PYTHON_QUARANTINE-compliant)

Same disposable Python capsule as BD6 pass-1 — no Python in runtime:

bench failures (post-BD6 reports/MBPP_HE_3MODE_V1.json + LCB)
  → tools/surgery/bench_to_poison_dataset.py            [Python]
    → 295 fresh post-BD6 fail rows
inline merge with poison_train_v1.jsonl (306 baseline rows)
  → poison_train_v2.jsonl                               [310 union, 260 with refs]
  → tools/surgery/train_code_skeleton_lora.py           [Python, QLoRA, GPU]
    rank=16 alpha=32 lr=2e-4 epochs=4 batch=1 max-len=1024
    avg loss: 0.45 → 0.30 → 0.22 → 0.18
  → tools/surgery/output/code_skeleton_lora_v2/         [PEFT adapter]
    → tools/surgery/merge_code_skeleton_lora.py         [Python, merge + planck7b_tool]
      → physarum05b_code_skeleton_v2.planck             [BF16 pack]
        → src/organs/organ_manager.cpp PHYS05_PACK retargeted
          → make -j4
            → C++ runtime mmaps the new pack
              → Mode-B re-bench (NO_7B_FALLBACK=1)
                → reports/MBPP_HE_3MODE_V1.json overwritten (Mode-B only)
                  → THIS REPORT
                    → REVERT PHYS05_PACK to v1 (production)

Python's only outputs were the JSONL dataset, the PEFT adapter dir, and the .planck pack. The runtime never imported torch/peft. Quarantine held.

Numbers

| bench | n | baseline B | BD6 pass-1 B | BD6.2 B | Δ vs pass-1 | |-------------|-----|------------|--------------|----------------|-------------| | MBPP | 100 | 6/100 | 13/100 | 6/100 | −7 (regression) | | HumanEval | 164 | 2/164 | 6/164 | 8/164 | +2 | | LiveCodeBench | 50 | 0/50 | 0/50 | 0/50 | 0 (dispatcher-leak unchanged) |

organs_used_set for both improved benches: {phys05_code_skeleton} only. fallback_count for both: 0 — quarantine held end-to-end. No 7B leak.

TASK 5 constraints, BD6.2

| constraint | post-surgery | |---------------------------------------------|--------------| | 0.5B organs used | ✅ phys05_code_skeleton only | | BD written | ✅ MBPP-B 86/100 envelopes, HE-B 117/164 envelopes carry food/poison/cond | | no route falls through wrong handler | ✅ | | no json_repair → ariz_e2e | ✅ unchanged from TASK 1 | | benchmark not hand-made | ✅ MBPP 100, HE 164 full official | | fallback_count visible | ✅ 0/0 | | B mode not skipped | ✅ ran 264/264 |

GREEN on every constraint. The architectural pipeline is sound. The model got worse on MBPP.

What went wrong

The v2 dataset was 310 unique task_ids, of which 291 overlapped with v1 (failed in both baseline and post-BD6). The 11 wins from BD6 pass-1 were removed from training because they no longer fail; what v2 added vs v1 was effectively four extra-hard prompts and four extra training epochs of pressure on the still-failing cases. The model specialized further on the hard tail and forgot the easy cases it had already mastered.

Concrete evidence (rows that passed in BD6 pass-1 but regressed in BD6.2):

MBPP: /14, /17, /20, /52, /53, /64, /96, /105 — at minimum 8 wins lost
HumanEval: /34, /53, /85 — at least 3 wins lost

Net: MBPP −7, HumanEval +2 — exactly the catastrophic-forgetting signature of overtraining a small adapter on a narrow hard set.

Lessons for BD6.3 (when ready)

**Don't train on the union of poison+already-mastered rows. Train on

poison only.** The next dataset should be fresh_failures (post-BD6 misses) plus a small anchor subset of confirmed pass-1 wins held out as positive contrast — not the entire historical poison.

Stop earlier. Loss already dropped to 0.30 by epoch 1 on this

data. Epochs 2 and 3 are where the regression happened. 1–2 epochs max for the next pass.

Lower learning rate or rank. lr=2e-4 with r=16 is aggressive for

~250 rows. Try lr=1e-4 / r=8 to reduce capacity for memorizing the hard tail at the expense of general patterns.

Eval-on-baseline before merging. Before flipping PHYS05_PACK,

run a quick smoke on the BD6-pass-1 wins. If any of them now fail, abort the merge — the LoRA is regressing.

LCB still 0. Not a surgery failure — dispatcher routes LCB

prompts through ARIZ before they reach phys05_code_skeleton. Fix the route classifier (separate task) before counting LCB as a surgery target.

Production state (after BD6.2)

PHYS05_PACK = physarum05b_code_skeleton.planck (BD6 pass-1, the

one with MBPP 13/100 / HE 6/164).

physarum05b_code_skeleton_v2.planck kept on disk as a snapshot but

not the live pack.

data/organ_surgery/phys05_code_skeleton/poison_train_v1.jsonl —

306 baseline-fail rows, archived.

data/organ_surgery/phys05_code_skeleton/poison_train.jsonl — 295

post-BD6 fresh-fail rows.

data/organ_surgery/phys05_code_skeleton/poison_train_v2.jsonl —

310-row union, the one that overtrained. Kept for the BD6.3 lessons above.

tools/surgery/output/code_skeleton_lora/ — the GOOD adapter

(BD6 pass-1).

tools/surgery/output/code_skeleton_lora_v2/ — the regression

adapter, kept as the negative result.

Files this pass touched

tools/surgery/bench_to_poison_dataset.py — re-ran (no code change)
inline merge script (one-shot Python) producing poison_train_v2.jsonl
tools/surgery/train_code_skeleton_lora.py — re-ran with --epochs 4
tools/surgery/merge_code_skeleton_lora.py — re-ran for v2 outputs
physarum05b_code_skeleton_v2.planck — written, not currently linked
src/organs/organ_manager.cpp::PHYS05_PACK — flipped to v2 then

flipped back to v1 (production restored)

docs/PYTHON_QUARANTINE.md — written this pass, stays green: the

whole regression cycle was inside the disposable surgery capsule

reports/BD6_2_OVERTRAIN_DELTA.md — this file.