BD6.8D-rank — capacity bump fails, BD6.x code_skeleton FROZEN (2026-05-02)

TL;DR — r=16/alpha=32 with the BD6.8D winning policy (MBPP/53 ×2, HE/85 ×2, long-poison ×0.5) gave 12/19, WORSE than r=8's 15/19. 4× LoRA capacity bumped average CE from 0.54 → 0.47 (model fits the training distribution tighter), but anchor pass-rate fell because the extra capacity introduced new drift on previously-stable rows (MBPP/52 regressed for the first time across ALL BD6.x). Strict gate rejects. Production reverted to BD6 pass-1 (anchor 19/19).

Per spec: BD6.x code_skeleton surgery cycle is FROZEN at current production state. BD6.5 is archived as best rejected LoRA artifact. Next surgery target: phys05_triz_contradiction.

Final result on the last attempt

| pass | rank | alpha | lr | policy | anchor | gate | |------|------|-------|----|----|------|------| | BD6.5 (peak) | 8 | 16 | 5e-5 | no weighting | 15/19 | REVERT | | BD6.8D | 8 | 16 | 5e-5 | {2,2,0.5} | 15/19 | REVERT (MBPP/53 fixed!) | | BD6.8D-rank | 16 | 32 | 5e-5 | {2,2,0.5} | 12/19 | REVERT (final, freeze) |

BD6.8D-rank v9 anchor (r=16, α=32):
  KEPT (12):
    ✓ MBPP/17, MBPP/19, MBPP/20, MBPP/41, MBPP/51,
    ✓ MBPP/64, MBPP/90, MBPP/93, MBPP/96, MBPP/99, MBPP/105,
    ✓ HumanEval/53
  LOST (7):
    ✗ MBPP/52        ← NEW REGRESSION (was stable across ALL BD6.x)
    ✗ MBPP/53        ← BD6.8D had recovered this; r=16 lost it again
    ✗ HumanEval/23   ← was stable in BD6.5/8D, lost in 8D2 and now here
    ✗ HumanEval/27
    ✗ HumanEval/34   ← decoder-noise (BD6.8F+)
    ✗ HumanEval/45   ← decoder-noise (BD6.8F+)
    ✗ HumanEval/85

avg_ce=0.4660 (vs r=8's 0.5429), avg_loss=0.4516 — model FITS the training data better at r=16, but generalizes WORSE on the anchor verifier. Classic over-parameterization failure mode at the anchor boundary.

Why r=16 didn't help

The LoRA had enough room to fit the holdout patterns more aggressively,

but ALSO enough room to drift on every other anchor. The poison gradient at long-target rows now finds more capacity to perturb hidden states; rep-penalty + ngram blocker at runtime can't fully compensate.

MBPP/52 (72-char, very simple anchor) regressed for the first time

across the entire BD6.x cycle. That's the canary: extra capacity destabilizes even the easiest rows.

Lowering lr to 3e-5 might have helped, but spec marks this as the

LAST attempt before freeze. Per protocol, we don't keep tuning.

The BD6.x cycle, complete picture

| pass | lever | r/α | anchor | gate | |-------------|----------------------------------------|-----|--------|----------| | BD6 pass-1 | poison v1, no anchor | 16/32 | 19/19 | KEEP | | BD6.3 | fresh poison only | 8/16 | 0/19 | REVERT | | BD6.4 | + 5× anchor replication | 8/16 | 7/19 | REVERT | | BD6.5 | + bench-aware repl + stratified, 53 % anchor | 8/16 | 15/19 | REVERT (peak) | | BD6.6 | + holdout × 50, 63 % anchor | 8/16 | 11/19 | REVERT (over-anchor) | | BD6.7 a/b/c | + KL anchor λ=0.10/0.20/0.05 | 8/16 | 12/10/12 | REVERT (KL redundant) | | BD6.8D | + token-weighted CE {2,2,0.5} | 8/16 | 15/19 | REVERT (MBPP/53 fixed) | | BD6.8D2 | + asymmetric weights {2,4,split} | 8/16 | 13/19 | REVERT (over-tuned) | | BD6.8D-rank | same {2,2,0.5} policy | 16/32 | 12/19 | REVERT (FREEZE) |

Three orthogonal levers — replication, KL, token-weighted CE — all peaked at exactly 15/19. Capacity also failed to lift it. The ceiling is real.

Final ceiling on `phys05_code_skeleton`

MBPP B    = 13/100   (production, frozen)
HumanEval B =  6/164   (production, frozen)
LCB B     =  0/50    (production, frozen)
Anchor    = 19/19    (under production decoder rep=1.15, ngram=2, cuda_rep=1.05)

Best LoRA artifact at native gate: BD6.5 / 15/19 (4 holdouts: MBPP/53, HE/34, HE/45, HE/85). Under pure-greedy decoder it scores 16/19, which is "weights are slightly cleaner than the gate shows" but cannot ship because production decoder requires rep-penalty.

The 4 BD6.5 holdouts split:

MBPP/53 — fixable via token-weighted CE (BD6.8D proved with ×2.0).

But the fix isolated cost the rest of the system at r=16.

HE/34, HE/45 — decoder-noise (BD6.8F+ proved). NOT a training

problem. Cannot fix without breaking production decoder.

HE/85 (366 char) — neither r=8/×2 nor r=8/×4 nor r=16/×2 moved

it. Likely beyond r=16/alpha=32 capacity OR a fundamental mismatch with the verifier's def-extractor on multi-line docstring patterns.

What this proves about the surgery framework

The strict gate works. It rejected EIGHT surgery passes in a row

(BD6.3, .4, .5, .6, .7×3, .8D, .8D2, .8D-rank). Production stayed at 19/19 the entire time. Not one of MBPP B / HE B / LCB B regressed during the cycle. The PYTHON_QUARANTINE doctrine + gate-and-revert protocol prevented every poisoned pack from leaking to runtime.

Each lever has a saturation point, and pushing past it breaks

more than it builds. This is now empirically confirmed across three orthogonal lever families.

r=8 is the right rank for this organ at this dataset. r=16 is

worse without architectural changes.

Future targets (NOT BD6.x; for the next cycle)

The remaining headroom on phys05_code_skeleton is not in training:

Verifier extractor relaxation in tools/surgery/anchor_eval.py

— accept up to 1 blank line inside def-body. Tests if HE/85 fails due to extractor strictness, not generation. ~5 lines.

Decoder/runtime co-training — train a LoRA on the production

decoder distribution (logits with rep-penalty applied during training). Expensive; not standard PEFT.

Larger base model — phys07_code_skeleton (Physarium-7B fork

with code-specialized weights). Out of BD6.x scope.

Per spec: NONE of these now. Move surgery to next organ.

Production state (FROZEN)

PHYS05_PACK = physarum05b_code_skeleton.planck (BD6 pass-1)
phys05_code_skeleton spec: rep=1.15, ng=2, cuda_rep=1.05
MBPP B 13/100, HE B 6/164, LCB B 0/50, anchor 19/19
All 8 rejected packs archived (physarum05b_code_skeleton_v{2..9}.planck,

*_v5_ckpt100.planck, *_v7_lambda005/010/020.planck, *_v8b.planck)

Best LoRA artifact: tools/surgery/output/code_skeleton_lora_v5/

(BD6.5, 15/19) — kept for reference

Files this final pass touched

tools/surgery/train_code_skeleton_lora_bd6_8d.py — reused (rank

passed via CLI, no code change)

physarum05b_code_skeleton_v9.planck — repacked (rejected, archived)
tools/surgery/output/code_skeleton_lora_v9/ — adapter + 5 ckpts (rejected)
tools/surgery/output/Physarum05B-CodeSkeleton-v9/ — merged HF dir (rejected)
src/organs/organ_manager.cpp::PHYS05_PACK — flipped to v9 then back to v1
reports/BD6_8D_RANK_FINAL_FREEZE.md — this file

Next surgery target

Per spec: phys05_triz_contradiction.

That's the next 0.5B organ in the OrganManager spec. Same surgery framework (anchor capture → poison harvest → mixed train → gate → revert if fail). Different verifier (hard_verifier) and different prompt template. Will need its own poison_train.jsonl harvested from TRIZ failures in the production bench.

phys05_code_skeleton is now CLOSED as a surgery target until either:

The verifier extractor is relaxed (different problem)
A different base model is available (different organ entirely)

The strict gate did its job. Production is preserved. BD6.x cycle ends with production unchanged at the numbers it had on day 1, and no regressions introduced. That's the real victory.

BD6.8D-rank — capacity bump fails, BD6.x code_skeleton FROZEN (2026-05-02)

BD6.8D-rank — capacity bump fails, BD6.x code_skeleton FROZEN (2026-05-02)

Final result on the last attempt

Why r=16 didn't help

The BD6.x cycle, complete picture

Final ceiling on phys05_code_skeleton

What this proves about the surgery framework

Future targets (NOT BD6.x; for the next cycle)

Production state (FROZEN)

Files this final pass touched

Next surgery target

Final ceiling on `phys05_code_skeleton`