BD6.8D2 — token-weighted CE refinement OVERTUNED, peak is BD6.8D (2026-05-02)

TL;DR — refining BD6.8D's weighting policy from {MBPP/53 ×2, HE/85 ×2, all-long-poison ×0.5} to {MBPP/53 ×2, HE/85 ×4, MBPP-poison-long ×0.5, HE-poison ×1.0} dropped anchor pass-rate from 15/19 → 13/19 AND regressed MBPP/53 (which BD6.8D had recovered). Six rows lost in v8b. Strict gate rejects. Production reverted. The BD6.x cycle's pattern holds: each lever has a saturation point, and pushing beyond it destabilizes. BD6.8D's modest weighting {2/2/0.5} = 15/19 was the peak of the token-weighted CE lever — exactly mirroring how BD6.5 was the peak of replication and BD6.7 KL had no useful rung. Training-only levers are now exhausted.

What changed vs BD6.8D

Per-row weight policy refined per user spec:

| row class | BD6.8D | BD6.8D2 | |----------------------------------------------|--------|---------| | anchor MBPP/53 | 2.0 | 2.0 | | anchor HumanEval/85 | 2.0 | 4.0 | | anchor (other 17) | 1.0 | 1.0 | | poison MBPP-class + tokens > 100 (long) | 0.5 | 0.5 | | poison HumanEval-class | 0.5 (if long) | 1.0 (always) | | poison short | 1.0 | 1.0 |

Hyperparams identical: r=8, alpha=16, lr=5e-5, ep=1, ckpt-step=100. Dataset bd6_5_mixed_train.jsonl unchanged.

Trainer counts during epoch:

HE/85 anchor ×4 visited 20 times
MBPP/53 anchor ×2 visited 20 times
MBPP-class long poison ×0.5: 12 visits
HE poison ×1.0: 158 visits

avg_ce=0.5390 (close to BD6.8D's 0.5429), avg_loss=0.6008 (higher than BD6.8D's 0.5377 because the ×4 on HE/85 dragged the weighted total up).

Headline result

| pass | trainer | anchor pass | LOST | |------|---------|-------------|------| | BD6.5 (peak) | bd6_5 (no weights) | 15/19 | MBPP/53, HE/34, HE/45, HE/85 | | BD6.8D | bd6_8d {2/2/0.5} | 15/19 | HE/27, HE/34, HE/45, HE/85 (MBPP/53 RECOVERED) | | BD6.8D2 | bd6_8d2 {2/4/0.5,1.0} | 13/19 | MBPP/53, HE/23, HE/27, HE/34, HE/45, HE/85 |

MBPP/53 regressed back to lost. The very holdout BD6.8D recovered fell again under the refined policy. HE/23 newly regressed — a previously-stable HE anchor. HE/85 still lost — even ×4 wasn't enough.

Why the refinement made things worse

Three knobs were changed simultaneously, each pulling in a slightly different direction:

HE/85 ×4.0 — pulled gradient very strongly toward this one

row's pattern. With only ~5 HE-anchor copies in the stratified stream, this concentrated the LoRA's HE-side capacity onto a single very-long target. Side effect: HE/23 (previously easy) drifted because its pattern got crowded out.

HE-poison ×1.0 (was implicitly ×0.5 in BD6.8D for long ones):

restored full poison signal on HE side, which is what BD6.8D2 wanted to avoid HE/27 collateral. But this ALSO restored full poison-vs-anchor tension on HE, which when combined with HE/85's ×4 demand forces the LoRA into a tighter regime than it can represent at r=8.

MBPP/53 ×2.0 alone (without the helping global long-poison

downweight on its bench): MBPP-anchor's ×2.0 boost was less effective because long-poison was no longer suppressing the competing HE-poison signal.

In short: BD6.8D's policy worked by making the LoRA less ambitious overall (tighter cap on long-poison everywhere). BD6.8D2's policy asks the LoRA to be more ambitious (full HE-poison signal) AND more focused (×4 on HE/85) at once. r=8 capacity isn't enough for both demands.

Pattern across the BD6.x cycle (now decisive)

Every lever the cycle has tried follows the same shape: one moderate setting works, refining it breaks it.

| lever | peak setting | over-tuned next step | cost | |------------------|-----------------------|----------------------|------| | replication | 53 % stratified (BD6.5) | 63 % w/ holdout×50 (BD6.6) | 15 → 11 | | KL distillation | (none — λ=0 was best) | λ=0.05/0.10/0.20 (BD6.7) | 15 → 12,12,10 | | token-weighted CE | {2,2,0.5} (BD6.8D) | {2,4,0.5,1.0} (BD6.8D2) | 15 → 13 |

The local maximum sits at exactly 15/19 under any single training-only lever. The 4 holdouts are not all training-soluble:

HE/34, HE/45 are decoder-noise (BD6.8F+ proven)
MBPP/53 is training-soluble (BD6.8D proved with ×2.0)
HE/85 has resisted both ×2.0 and ×4.0, and ×4.0 actually broke

the rest of the system around it

We are at the r=8 LoRA capacity ceiling for this organ at this runtime configuration. Further training-only experiments will not move the needle.

Per-row v8b table

KEPT (13):
  ✓ MBPP/17, MBPP/19, MBPP/20, MBPP/41, MBPP/51, MBPP/52,
  ✓ MBPP/64, MBPP/90, MBPP/93, MBPP/96, MBPP/99, MBPP/105
  ✓ HumanEval/53

LOST (6):
  ✗ MBPP/53        ← REGRESSED FROM BD6.8D recovery
  ✗ HumanEval/23   ← NEW REGRESSION (was stable in BD6.5/8D)
  ✗ HumanEval/27   ← still lost
  ✗ HumanEval/34   ← decoder-noise
  ✗ HumanEval/45   ← decoder-noise
  ✗ HumanEval/85   ← still lost (×4 insufficient)

What stays open

If the user wants to pursue further, only one orthogonal lever remains:

BD6.8D-rank — increase LoRA capacity

r=16, alpha=32 (4× param count)
same {MBPP/53 ×2, HE/85 ×2} weight policy as BD6.8D (not D2)
same dataset
tests whether r=8 was the capacity ceiling

If r=16 doesn't move HE/85, the ceiling is truly architecture + verifier + decoder, not training-side.

Otherwise: ship

Production stays at BD6 pass-1 (anchor 19/19 under current decoder).
BD6.5 is documented as best-known LoRA artifact (15/19, weights are

cleaner than gate shows under greedy).

MBPP B = 13/100, HE B = 6/164, LCB = 0/50 stand as the bench numbers.

Production state (after BD6.8D2 revert)

PHYS05_PACK = physarum05b_code_skeleton.planck (BD6 pass-1).
phys05_code_skeleton spec: rep=1.15, ng=2, cuda_rep=1.05 (unchanged).
physarum05b_code_skeleton_v8b.planck archived.
tools/surgery/output/code_skeleton_lora_v8b/ archived (final + 5 mid-checkpoints).
tools/surgery/output/Physarum05B-CodeSkeleton-v8b/ archived (rejected merged HF dir).

Files this pass touched

tools/surgery/train_code_skeleton_lora_bd6_8d2.py — new, refined trainer
physarum05b_code_skeleton_v8b.planck — repacked (rejected)
tools/surgery/output/code_skeleton_lora_v8b/ — adapter + ckpts (rejected)
tools/surgery/output/Physarum05B-CodeSkeleton-v8b/ — merged HF dir (rejected)
src/organs/organ_manager.cpp::PHYS05_PACK — flipped to v8b then back to v1
reports/BD6_8D2_OVER_TUNED.md — this file

What this proves

Token-weighted CE has a saturation point, exactly like

replication and (effectively) like KL. Pushing past it breaks the LoRA's anchor consistency more than it helps the targeted holdout.

r=8 LoRA capacity is the binding constraint. Three different

training-side levers all peak at 15/19. Either the rank needs to go up, or the runtime/verifier side has the remaining headroom.

The strict gate has held seven times in a row (BD6.3, .4, .5,

.6, .7, .8D, .8D2). Production has stayed at 19/19 throughout. The gate is doing exactly what it was designed for.

The cleanest read of the BD6.x cycle: BD6.5's stratified, lr=5e-5, no-KL, no-weighted recipe is the best training-only result achievable at r=8/alpha=16. Anything beyond requires either rank bump or decoder/verifier surgery — neither of which is a "finish BD6" task.