# BD6 — phys05_code_skeleton surgery delta vs frozen baseline

_Last verified: 2026-05-03 (data: MBPP_HE @ 2026-05-01 22:26, LCB @ 2026-05-01 20:23; baseline frozen in `BENCH_CLEANUP_AND_OFFICIAL_RUN.md`)._

The first organ-only surgery (BD6 pass-1) measured against the frozen
BENCH_CLEANUP_AND_OFFICIAL_RUN baseline. BD6.2 follow-up overtrained
and was reverted — current production pack is BD6 pass-1.

---

## Pipeline (BD6 pass-1)

1. **Poison harvest** — `tools/surgery/bench_to_poison_dataset.py` reads Mode-B failure rows from `reports/MBPP_HE_3MODE_V1.json` and `reports/LIVECODEBENCH_3MODE_V1.json`, joins with official MBPP / HumanEval / LCB reference solutions. Result: **306 poison rows, 256 with reference targets** (LCB ships no canonical solutions). Saved to `data/organ_surgery/phys05_code_skeleton/poison_train.jsonl`.

2. **QLoRA training** — `tools/surgery/train_code_skeleton_lora.py`. Base = `/home/pc/gigachad/qwen05/physarum` (BF16 0.5B HF dir). r=16, α=32, lr=2e-4, 3 epochs, batch=1, max-len=1024, target_modules=q/k/v/o_proj. 256 rows × 3 epochs = 768 steps. Avg loss: 0.44 → 0.30 → 0.21. Trainable params 2.16 M / 496 M (0.44 %).

3. **Merge + repack** — `tools/surgery/merge_code_skeleton_lora.py`. PEFT `merge_and_unload` → BF16 HF `Physarum05B-CodeSkeleton/` → `build/planck7b_tool build` → `physarum05b_code_skeleton.planck` (988 MB BF16, same shape as baseline pack).

4. **Pack flip + rebuild** — `src/organs/organ_manager.cpp:31` `PHYS05_PACK` retargeted; `make -j4` rebuilds `gigachad_native`.

5. **Mode-B re-run** — `mbpp_he_3mode.py --modes B` and `livecodebench_3mode.py --modes B`. Same prompts, same harness, same NO_7B_FALLBACK gate.

---

## Headline delta (current production pack vs frozen baseline)

| bench         | n   | baseline B | post-surgery B | Δ abs | Δ rel    | wall before | wall after |
|---------------|-----|------------|----------------|-------|----------|-------------|------------|
| MBPP          | 100 | 6/100      | **13/100**     | **+7** | **+117 %** | 550 s       | **286.6 s** |
| HumanEval     | 164 | 2/164      | **6/164**      | **+4** | **+200 %** | 1083 s      | **570.4 s** |
| LiveCodeBench | 50  | 0/50       | 0/50           | 0     | 0 %      | 7 s         | 431.4 s    |

**MBPP doubled. HumanEval tripled. LCB unchanged** (LCB prompts route through ARIZ → unsupported under `NO_7B_FALLBACK=1`; surgery doesn't touch that lane — confirmed by `organs=phys05_triz_contradiction` rather than `phys05_code_skeleton` in LCB-B logs).

`organs_used_set` for both improved benches: `{phys05_code_skeleton}` only.
`fallback_count` for both: **0** — no 7B leaked.

`bd_signal_count` (rows that wrote DAG envelope with food/poison/conductance): MBPP 88, HumanEval 117, LCB 32 — the BD3 poison stream for the next surgery pass is alive.

---

## BD6.2 cycle summary (snapshot retained, pack reverted)

A second pass tried to ride the same recipe on the post-BD6 failures (310 union rows, 4 epochs):

| bench         | baseline B | BD6.2 snapshot | vs BD6 pass-1 |
|---------------|------------|----------------|---------------|
| MBPP          | 6/100      | 6/100          | **−7 (regress)** |
| HumanEval     | 2/164      | 8/164          | +2 (modest)   |
| LCB           | 0/50       | 0/50           | 0             |

Net: net regression on the bench we cared about most (MBPP). Pack reverted to BD6 pass-1; snapshot retained at `reports/MBPP_HE_3MODE_V1_bd6_2_snapshot.json` for autopsy. Full write-up in `reports/BD6_2_OVERTRAIN_DELTA.md`.

---

## TASK 5 constraints check (current production = BD6 pass-1)

| constraint                                  | status |
|---------------------------------------------|--------------|
| 0.5B organs used                            | ✅ `phys05_code_skeleton` only |
| BD written                                   | ✅ MBPP-B 88, HE-B 117, LCB-B 32 envelopes carry food/poison/conductance |
| no route falls through wrong handler        | ✅ `route=code_fast` for in-lane MBPP/HE prompts |
| no json_repair → ariz_e2e                   | ✅ unchanged from TASK 1 |
| benchmark not hand-made easy subset         | ✅ MBPP n=100 official, HumanEval n=164 full official, LCB official easy n=50 |
| fallback_count visible                       | ✅ 0/0/0 |
| B mode not skipped                           | ✅ ran on all 314 prompts |

---

## What this proves

* **The BD3 → BD6 pipeline works end-to-end.** Bench failures → poison dataset → QLoRA → merge → repack → Mode-B re-run → measurable improvement on the same official benchmarks. No new architecture, no new toy benches.
* **The 0.5B can be made better with self-collected failures.** Loss curve drop (0.44 → 0.21) and the +7 / +4 absolute gains on MBPP / HumanEval are real on official splits.
* **Output quality shifted from "no-code" to "compile-fail / wrong-answer".** In the baseline, most B-mode failures were syntactic refusals. Post-surgery, many failures are now `AssertionError` / runtime errors — the model produces the function but gets logic wrong. That's the next layer of poison to harvest.
* **LCB unchanged is honest.** Bench-time dispatcher routes LCB prompts to ARIZ before they ever see `phys05_code_skeleton`. Fixing requires either widening `looks_like_humaneval` to capture competitive-programming shape, or gating Mode-B-on-LCB to bypass dispatcher. Out of scope for BD6.
* **BD6.2 told us the recipe has a ceiling.** Same poison, more epochs, larger union set → MBPP regression. Next pass needs a different lever (per-bench stratified poison, KL-anchor, asymmetric holdout) — see BD6.7/BD6.8D ladder.

---

## Targets (per user spec)

| target                              | hit?  |
|-------------------------------------|-------|
| MBPP B-mode 6/100 → 25/100 first    | partial: 13/100 (+7 of +19 needed). |
| HumanEval B-mode 2/164 → 20/164 first | partial: 6/164 (+4 of +18 needed). |
| LCB B-mode 0/50 → 5/50 first         | not hit: 0/50 — dispatcher routing fix is prerequisite, not more surgery. |

---

## Files written / changed (BD6 pass-1)

* `tools/surgery/bench_to_poison_dataset.py` — harvests Mode-B failures + ground truth
* `tools/surgery/train_code_skeleton_lora.py` — QLoRA on 0.5B
* `tools/surgery/merge_code_skeleton_lora.py` — merge + planck7b_tool repack
* `data/organ_surgery/phys05_code_skeleton/poison_train.jsonl` — 306 rows
* `tools/surgery/output/code_skeleton_lora/` — PEFT adapter
* `tools/surgery/output/Physarum05B-CodeSkeleton/` — merged BF16 HF dir
* `physarum05b_code_skeleton.planck` — 988 MB BF16 production pack
* `src/organs/organ_manager.cpp` — `PHYS05_PACK` retargeted
* `reports/MBPP_HE_3MODE_V1.{md,json}` — Mode-B post-surgery rows
* `reports/LIVECODEBENCH_3MODE_V1.{md,json}` — Mode-B post-surgery rows
* `reports/MBPP_HE_3MODE_V1_bd6_2_snapshot.json` — BD6.2 autopsy snapshot
* `reports/BD6_2_OVERTRAIN_DELTA.md` — BD6.2 honest write-up
* `reports/BD6_POST_SURGERY_DELTA.md` — this file
