CyberdyneLabs · Reports · BD6_8D_RANK_FINAL_FREEZE

BD6.8D-rank — capacity bump fails, BD6.x code_skeleton FROZEN (2026-05-02)

reports/BD6_8D_RANK_FINAL_FREEZE.md 1128 words raw markdown ↗

BD6.8D-rank — capacity bump fails, BD6.x code_skeleton FROZEN (2026-05-02)

TL;DR — r=16/alpha=32 with the BD6.8D winning policy (MBPP/53 ×2, HE/85 ×2, long-poison ×0.5) gave 12/19, WORSE than r=8's 15/19. 4× LoRA capacity bumped average CE from 0.54 → 0.47 (model fits the training distribution tighter), but anchor pass-rate fell because the extra capacity introduced new drift on previously-stable rows (MBPP/52 regressed for the first time across ALL BD6.x). Strict gate rejects. Production reverted to BD6 pass-1 (anchor 19/19).

Per spec: BD6.x code_skeleton surgery cycle is FROZEN at current production state. BD6.5 is archived as best rejected LoRA artifact. Next surgery target: phys05_triz_contradiction.


Final result on the last attempt

| pass | rank | alpha | lr | policy | anchor | gate | |------|------|-------|----|----|------|------| | BD6.5 (peak) | 8 | 16 | 5e-5 | no weighting | 15/19 | REVERT | | BD6.8D | 8 | 16 | 5e-5 | {2,2,0.5} | 15/19 | REVERT (MBPP/53 fixed!) | | BD6.8D-rank | 16 | 32 | 5e-5 | {2,2,0.5} | 12/19 | REVERT (final, freeze) |

BD6.8D-rank v9 anchor (r=16, α=32):
  KEPT (12):
    ✓ MBPP/17, MBPP/19, MBPP/20, MBPP/41, MBPP/51,
    ✓ MBPP/64, MBPP/90, MBPP/93, MBPP/96, MBPP/99, MBPP/105,
    ✓ HumanEval/53
  LOST (7):
    ✗ MBPP/52        ← NEW REGRESSION (was stable across ALL BD6.x)
    ✗ MBPP/53        ← BD6.8D had recovered this; r=16 lost it again
    ✗ HumanEval/23   ← was stable in BD6.5/8D, lost in 8D2 and now here
    ✗ HumanEval/27
    ✗ HumanEval/34   ← decoder-noise (BD6.8F+)
    ✗ HumanEval/45   ← decoder-noise (BD6.8F+)
    ✗ HumanEval/85

avg_ce=0.4660 (vs r=8's 0.5429), avg_loss=0.4516 — model FITS the training data better at r=16, but generalizes WORSE on the anchor verifier. Classic over-parameterization failure mode at the anchor boundary.

Why r=16 didn't help

but ALSO enough room to drift on every other anchor. The poison gradient at long-target rows now finds more capacity to perturb hidden states; rep-penalty + ngram blocker at runtime can't fully compensate.

across the entire BD6.x cycle. That's the canary: extra capacity destabilizes even the easiest rows.

LAST attempt before freeze. Per protocol, we don't keep tuning.

The BD6.x cycle, complete picture

| pass | lever | r/α | anchor | gate | |-------------|----------------------------------------|-----|--------|----------| | BD6 pass-1 | poison v1, no anchor | 16/32 | 19/19 | KEEP | | BD6.3 | fresh poison only | 8/16 | 0/19 | REVERT | | BD6.4 | + 5× anchor replication | 8/16 | 7/19 | REVERT | | BD6.5 | + bench-aware repl + stratified, 53 % anchor | 8/16 | 15/19 | REVERT (peak) | | BD6.6 | + holdout × 50, 63 % anchor | 8/16 | 11/19 | REVERT (over-anchor) | | BD6.7 a/b/c | + KL anchor λ=0.10/0.20/0.05 | 8/16 | 12/10/12 | REVERT (KL redundant) | | BD6.8D | + token-weighted CE {2,2,0.5} | 8/16 | 15/19 | REVERT (MBPP/53 fixed) | | BD6.8D2 | + asymmetric weights {2,4,split} | 8/16 | 13/19 | REVERT (over-tuned) | | BD6.8D-rank | same {2,2,0.5} policy | 16/32 | 12/19 | REVERT (FREEZE) |

Three orthogonal levers — replication, KL, token-weighted CE — all peaked at exactly 15/19. Capacity also failed to lift it. The ceiling is real.

Final ceiling on phys05_code_skeleton

MBPP B    = 13/100   (production, frozen)
HumanEval B =  6/164   (production, frozen)
LCB B     =  0/50    (production, frozen)
Anchor    = 19/19    (under production decoder rep=1.15, ngram=2, cuda_rep=1.05)

Best LoRA artifact at native gate: BD6.5 / 15/19 (4 holdouts: MBPP/53, HE/34, HE/45, HE/85). Under pure-greedy decoder it scores 16/19, which is "weights are slightly cleaner than the gate shows" but cannot ship because production decoder requires rep-penalty.

The 4 BD6.5 holdouts split:

But the fix isolated cost the rest of the system at r=16.

problem. Cannot fix without breaking production decoder.

it. Likely beyond r=16/alpha=32 capacity OR a fundamental mismatch with the verifier's def-extractor on multi-line docstring patterns.

What this proves about the surgery framework

(BD6.3, .4, .5, .6, .7×3, .8D, .8D2, .8D-rank). Production stayed at 19/19 the entire time. Not one of MBPP B / HE B / LCB B regressed during the cycle. The PYTHON_QUARANTINE doctrine + gate-and-revert protocol prevented every poisoned pack from leaking to runtime.

more than it builds. This is now empirically confirmed across three orthogonal lever families.

worse without architectural changes.

Future targets (NOT BD6.x; for the next cycle)

The remaining headroom on phys05_code_skeleton is not in training:

— accept up to 1 blank line inside def-body. Tests if HE/85 fails due to extractor strictness, not generation. ~5 lines.

decoder distribution (logits with rep-penalty applied during training). Expensive; not standard PEFT.

with code-specialized weights). Out of BD6.x scope.

Per spec: NONE of these now. Move surgery to next organ.

Production state (FROZEN)

*_v5_ckpt100.planck, *_v7_lambda005/010/020.planck, *_v8b.planck)

(BD6.5, 15/19) — kept for reference

Files this final pass touched

passed via CLI, no code change)

Next surgery target

Per spec: phys05_triz_contradiction.

That's the next 0.5B organ in the OrganManager spec. Same surgery framework (anchor capture → poison harvest → mixed train → gate → revert if fail). Different verifier (hard_verifier) and different prompt template. Will need its own poison_train.jsonl harvested from TRIZ failures in the production bench.

phys05_code_skeleton is now CLOSED as a surgery target until either:

The strict gate did its job. Production is preserved. BD6.x cycle ends with production unchanged at the numbers it had on day 1, and no regressions introduced. That's the real victory.