BD6.2 — overtrain regression on phys05_code_skeleton (2026-05-01)
TL;DR — BD6.2 was a regression on MBPP, modest gain on HumanEval, no LCB movement. Reverted PHYS05_PACK to BD6 pass-1 pack as production. Pipeline did not fail — overtraining did. Honest write-up below so the next pass doesn't repeat the same mistake.
Pipeline (PYTHON_QUARANTINE-compliant)
Same disposable Python capsule as BD6 pass-1 — no Python in runtime:
bench failures (post-BD6 reports/MBPP_HE_3MODE_V1.json + LCB)
→ tools/surgery/bench_to_poison_dataset.py [Python]
→ 295 fresh post-BD6 fail rows
inline merge with poison_train_v1.jsonl (306 baseline rows)
→ poison_train_v2.jsonl [310 union, 260 with refs]
→ tools/surgery/train_code_skeleton_lora.py [Python, QLoRA, GPU]
rank=16 alpha=32 lr=2e-4 epochs=4 batch=1 max-len=1024
avg loss: 0.45 → 0.30 → 0.22 → 0.18
→ tools/surgery/output/code_skeleton_lora_v2/ [PEFT adapter]
→ tools/surgery/merge_code_skeleton_lora.py [Python, merge + planck7b_tool]
→ physarum05b_code_skeleton_v2.planck [BF16 pack]
→ src/organs/organ_manager.cpp PHYS05_PACK retargeted
→ make -j4
→ C++ runtime mmaps the new pack
→ Mode-B re-bench (NO_7B_FALLBACK=1)
→ reports/MBPP_HE_3MODE_V1.json overwritten (Mode-B only)
→ THIS REPORT
→ REVERT PHYS05_PACK to v1 (production)
Python's only outputs were the JSONL dataset, the PEFT adapter dir, and the .planck pack. The runtime never imported torch/peft. Quarantine held.
Numbers
| bench | n | baseline B | BD6 pass-1 B | BD6.2 B | Δ vs pass-1 | |-------------|-----|------------|--------------|----------------|-------------| | MBPP | 100 | 6/100 | 13/100 | 6/100 | −7 (regression) | | HumanEval | 164 | 2/164 | 6/164 | 8/164 | +2 | | LiveCodeBench | 50 | 0/50 | 0/50 | 0/50 | 0 (dispatcher-leak unchanged) |
organs_used_set for both improved benches: {phys05_code_skeleton} only. fallback_count for both: 0 — quarantine held end-to-end. No 7B leak.
TASK 5 constraints, BD6.2
| constraint | post-surgery | |---------------------------------------------|--------------| | 0.5B organs used | ✅ phys05_code_skeleton only | | BD written | ✅ MBPP-B 86/100 envelopes, HE-B 117/164 envelopes carry food/poison/cond | | no route falls through wrong handler | ✅ | | no json_repair → ariz_e2e | ✅ unchanged from TASK 1 | | benchmark not hand-made | ✅ MBPP 100, HE 164 full official | | fallback_count visible | ✅ 0/0 | | B mode not skipped | ✅ ran 264/264 |
GREEN on every constraint. The architectural pipeline is sound. The model got worse on MBPP.
What went wrong
The v2 dataset was 310 unique task_ids, of which 291 overlapped with v1 (failed in both baseline and post-BD6). The 11 wins from BD6 pass-1 were removed from training because they no longer fail; what v2 added vs v1 was effectively four extra-hard prompts and four extra training epochs of pressure on the still-failing cases. The model specialized further on the hard tail and forgot the easy cases it had already mastered.
Concrete evidence (rows that passed in BD6 pass-1 but regressed in BD6.2):
- MBPP: /14, /17, /20, /52, /53, /64, /96, /105 — at minimum 8 wins lost
- HumanEval: /34, /53, /85 — at least 3 wins lost
Net: MBPP −7, HumanEval +2 — exactly the catastrophic-forgetting signature of overtraining a small adapter on a narrow hard set.
Lessons for BD6.3 (when ready)
- **Don't train on the union of poison+already-mastered rows. Train on
poison only.** The next dataset should be fresh_failures (post-BD6 misses) plus a small anchor subset of confirmed pass-1 wins held out as positive contrast — not the entire historical poison.
- Stop earlier. Loss already dropped to 0.30 by epoch 1 on this
data. Epochs 2 and 3 are where the regression happened. 1–2 epochs max for the next pass.
- Lower learning rate or rank. lr=2e-4 with r=16 is aggressive for
~250 rows. Try lr=1e-4 / r=8 to reduce capacity for memorizing the hard tail at the expense of general patterns.
- Eval-on-baseline before merging. Before flipping
PHYS05_PACK,
run a quick smoke on the BD6-pass-1 wins. If any of them now fail, abort the merge — the LoRA is regressing.
- LCB still 0. Not a surgery failure — dispatcher routes LCB
prompts through ARIZ before they reach phys05_code_skeleton. Fix the route classifier (separate task) before counting LCB as a surgery target.
Production state (after BD6.2)
PHYS05_PACK = physarum05b_code_skeleton.planck(BD6 pass-1, the
one with MBPP 13/100 / HE 6/164).
physarum05b_code_skeleton_v2.planckkept on disk as a snapshot but
not the live pack.
data/organ_surgery/phys05_code_skeleton/poison_train_v1.jsonl—
306 baseline-fail rows, archived.
data/organ_surgery/phys05_code_skeleton/poison_train.jsonl— 295
post-BD6 fresh-fail rows.
data/organ_surgery/phys05_code_skeleton/poison_train_v2.jsonl—
310-row union, the one that overtrained. Kept for the BD6.3 lessons above.
tools/surgery/output/code_skeleton_lora/— the GOOD adapter
(BD6 pass-1).
tools/surgery/output/code_skeleton_lora_v2/— the regression
adapter, kept as the negative result.
Files this pass touched
tools/surgery/bench_to_poison_dataset.py— re-ran (no code change)- inline merge script (one-shot Python) producing
poison_train_v2.jsonl tools/surgery/train_code_skeleton_lora.py— re-ran with--epochs 4tools/surgery/merge_code_skeleton_lora.py— re-ran for v2 outputsphysarum05b_code_skeleton_v2.planck— written, not currently linkedsrc/organs/organ_manager.cpp::PHYS05_PACK— flipped to v2 then
flipped back to v1 (production restored)
docs/PYTHON_QUARANTINE.md— written this pass, stays green: the
whole regression cycle was inside the disposable surgery capsule
reports/BD6_2_OVERTRAIN_DELTA.md— this file.