# BD6.8D-rank — capacity bump fails, BD6.x code_skeleton FROZEN (2026-05-02)

**TL;DR — r=16/alpha=32 with the BD6.8D winning policy (MBPP/53 ×2,
HE/85 ×2, long-poison ×0.5) gave 12/19, WORSE than r=8's 15/19.
4× LoRA capacity bumped average CE from 0.54 → 0.47 (model fits the
training distribution tighter), but anchor pass-rate fell because the
extra capacity introduced new drift on previously-stable rows
(MBPP/52 regressed for the first time across ALL BD6.x). Strict gate
rejects. Production reverted to BD6 pass-1 (anchor 19/19).**

**Per spec: BD6.x code_skeleton surgery cycle is FROZEN at current
production state. BD6.5 is archived as best rejected LoRA artifact.
Next surgery target: phys05_triz_contradiction.**

---

## Final result on the last attempt

| pass | rank | alpha | lr | policy | anchor | gate |
|------|------|-------|----|----|------|------|
| BD6.5 (peak)  | 8  | 16 | 5e-5 | no weighting | 15/19 | REVERT |
| BD6.8D        | 8  | 16 | 5e-5 | {2,2,0.5} | 15/19 | REVERT (MBPP/53 fixed!) |
| **BD6.8D-rank** | **16** | **32** | **5e-5** | {2,2,0.5} | **12/19** | **REVERT (final, freeze)** |

```
BD6.8D-rank v9 anchor (r=16, α=32):
  KEPT (12):
    ✓ MBPP/17, MBPP/19, MBPP/20, MBPP/41, MBPP/51,
    ✓ MBPP/64, MBPP/90, MBPP/93, MBPP/96, MBPP/99, MBPP/105,
    ✓ HumanEval/53
  LOST (7):
    ✗ MBPP/52        ← NEW REGRESSION (was stable across ALL BD6.x)
    ✗ MBPP/53        ← BD6.8D had recovered this; r=16 lost it again
    ✗ HumanEval/23   ← was stable in BD6.5/8D, lost in 8D2 and now here
    ✗ HumanEval/27
    ✗ HumanEval/34   ← decoder-noise (BD6.8F+)
    ✗ HumanEval/45   ← decoder-noise (BD6.8F+)
    ✗ HumanEval/85
```

avg_ce=0.4660 (vs r=8's 0.5429), avg_loss=0.4516 — model FITS the
training data better at r=16, but generalizes WORSE on the anchor
verifier. Classic over-parameterization failure mode at the anchor
boundary.

## Why r=16 didn't help

* The LoRA had enough room to fit the holdout patterns more aggressively,
  but ALSO enough room to drift on every other anchor. The poison
  gradient at long-target rows now finds more capacity to perturb
  hidden states; rep-penalty + ngram blocker at runtime can't fully
  compensate.
* MBPP/52 (72-char, very simple anchor) regressed for the first time
  across the entire BD6.x cycle. That's the canary: extra capacity
  destabilizes even the easiest rows.
* Lowering lr to 3e-5 might have helped, but spec marks this as the
  LAST attempt before freeze. Per protocol, we don't keep tuning.

## The BD6.x cycle, complete picture

| pass        | lever                                  | r/α | anchor | gate     |
|-------------|----------------------------------------|-----|--------|----------|
| BD6 pass-1  | poison v1, no anchor                   | 16/32 | 19/19 | KEEP     |
| BD6.3       | fresh poison only                      | 8/16  | 0/19  | REVERT   |
| BD6.4       | + 5× anchor replication                | 8/16  | 7/19  | REVERT   |
| BD6.5       | + bench-aware repl + stratified, 53 % anchor | 8/16  | **15/19** | REVERT (peak) |
| BD6.6       | + holdout × 50, 63 % anchor            | 8/16  | 11/19 | REVERT (over-anchor) |
| BD6.7 a/b/c | + KL anchor λ=0.10/0.20/0.05           | 8/16  | 12/10/12 | REVERT (KL redundant) |
| BD6.8D      | + token-weighted CE {2,2,0.5}          | 8/16  | **15/19** | REVERT (MBPP/53 fixed) |
| BD6.8D2     | + asymmetric weights {2,4,split}       | 8/16  | 13/19 | REVERT (over-tuned) |
| **BD6.8D-rank** | **same {2,2,0.5} policy**           | **16/32** | **12/19** | **REVERT (FREEZE)** |

**Three orthogonal levers** — replication, KL, token-weighted CE —
**all peaked at exactly 15/19**.
**Capacity** also failed to lift it. The ceiling is real.

## Final ceiling on `phys05_code_skeleton`

```
MBPP B    = 13/100   (production, frozen)
HumanEval B =  6/164   (production, frozen)
LCB B     =  0/50    (production, frozen)
Anchor    = 19/19    (under production decoder rep=1.15, ngram=2, cuda_rep=1.05)
```

Best LoRA artifact at native gate: **BD6.5 / 15/19** (4 holdouts:
MBPP/53, HE/34, HE/45, HE/85). Under pure-greedy decoder it scores
16/19, which is "weights are slightly cleaner than the gate shows"
but cannot ship because production decoder requires rep-penalty.

The 4 BD6.5 holdouts split:
* **MBPP/53** — fixable via token-weighted CE (BD6.8D proved with ×2.0).
  But the fix isolated cost the rest of the system at r=16.
* **HE/34, HE/45** — decoder-noise (BD6.8F+ proved). NOT a training
  problem. Cannot fix without breaking production decoder.
* **HE/85 (366 char)** — neither r=8/×2 nor r=8/×4 nor r=16/×2 moved
  it. Likely beyond r=16/alpha=32 capacity OR a fundamental mismatch
  with the verifier's def-extractor on multi-line docstring patterns.

## What this proves about the surgery framework

* **The strict gate works.** It rejected EIGHT surgery passes in a row
  (BD6.3, .4, .5, .6, .7×3, .8D, .8D2, .8D-rank). Production stayed
  at 19/19 the entire time. Not one of MBPP B / HE B / LCB B regressed
  during the cycle. The PYTHON_QUARANTINE doctrine + gate-and-revert
  protocol prevented every poisoned pack from leaking to runtime.
* **Each lever has a saturation point**, and pushing past it breaks
  more than it builds. This is now empirically confirmed across
  three orthogonal lever families.
* **r=8 is the right rank** for this organ at this dataset. r=16 is
  worse without architectural changes.

## Future targets (NOT BD6.x; for the next cycle)

The remaining headroom on `phys05_code_skeleton` is *not* in training:
* **Verifier extractor relaxation** in `tools/surgery/anchor_eval.py`
  — accept up to 1 blank line inside def-body. Tests if HE/85 fails
  due to extractor strictness, not generation. ~5 lines.
* **Decoder/runtime co-training** — train a LoRA on the production
  decoder distribution (logits with rep-penalty applied during
  training). Expensive; not standard PEFT.
* **Larger base model** — phys07_code_skeleton (Physarium-7B fork
  with code-specialized weights). Out of BD6.x scope.

Per spec: NONE of these now. Move surgery to next organ.

## Production state (FROZEN)

* `PHYS05_PACK = physarum05b_code_skeleton.planck` (BD6 pass-1)
* phys05_code_skeleton spec: rep=1.15, ng=2, cuda_rep=1.05
* MBPP B 13/100, HE B 6/164, LCB B 0/50, anchor 19/19
* All 8 rejected packs archived (`physarum05b_code_skeleton_v{2..9}.planck`,
  `*_v5_ckpt100.planck`, `*_v7_lambda005/010/020.planck`, `*_v8b.planck`)
* Best LoRA artifact: `tools/surgery/output/code_skeleton_lora_v5/`
  (BD6.5, 15/19) — kept for reference

## Files this final pass touched

* `tools/surgery/train_code_skeleton_lora_bd6_8d.py` — reused (rank
  passed via CLI, no code change)
* `physarum05b_code_skeleton_v9.planck` — repacked (rejected, archived)
* `tools/surgery/output/code_skeleton_lora_v9/` — adapter + 5 ckpts (rejected)
* `tools/surgery/output/Physarum05B-CodeSkeleton-v9/` — merged HF dir (rejected)
* `src/organs/organ_manager.cpp::PHYS05_PACK` — flipped to v9 then back to v1
* `reports/BD6_8D_RANK_FINAL_FREEZE.md` — this file

## Next surgery target

Per spec: **phys05_triz_contradiction**.

That's the next 0.5B organ in the OrganManager spec. Same surgery
framework (anchor capture → poison harvest → mixed train → gate →
revert if fail). Different verifier (`hard_verifier`) and different
prompt template. Will need its own poison_train.jsonl harvested from
TRIZ failures in the production bench.

`phys05_code_skeleton` is now **CLOSED** as a surgery target until
either:
* The verifier extractor is relaxed (different problem)
* A different base model is available (different organ entirely)

The strict gate did its job. Production is preserved. BD6.x cycle
ends with production unchanged at the numbers it had on day 1, and
no regressions introduced. That's the real victory.
