BD6.8D2 — token-weighted CE refinement OVERTUNED, peak is BD6.8D (2026-05-02)
TL;DR — refining BD6.8D's weighting policy from {MBPP/53 ×2, HE/85 ×2, all-long-poison ×0.5} to {MBPP/53 ×2, HE/85 ×4, MBPP-poison-long ×0.5, HE-poison ×1.0} dropped anchor pass-rate from 15/19 → 13/19 AND regressed MBPP/53 (which BD6.8D had recovered). Six rows lost in v8b. Strict gate rejects. Production reverted. The BD6.x cycle's pattern holds: each lever has a saturation point, and pushing beyond it destabilizes. BD6.8D's modest weighting {2/2/0.5} = 15/19 was the peak of the token-weighted CE lever — exactly mirroring how BD6.5 was the peak of replication and BD6.7 KL had no useful rung. Training-only levers are now exhausted.
What changed vs BD6.8D
Per-row weight policy refined per user spec:
| row class | BD6.8D | BD6.8D2 | |----------------------------------------------|--------|---------| | anchor MBPP/53 | 2.0 | 2.0 | | anchor HumanEval/85 | 2.0 | 4.0 | | anchor (other 17) | 1.0 | 1.0 | | poison MBPP-class + tokens > 100 (long) | 0.5 | 0.5 | | poison HumanEval-class | 0.5 (if long) | 1.0 (always) | | poison short | 1.0 | 1.0 |
Hyperparams identical: r=8, alpha=16, lr=5e-5, ep=1, ckpt-step=100. Dataset bd6_5_mixed_train.jsonl unchanged.
Trainer counts during epoch:
- HE/85 anchor ×4 visited 20 times
- MBPP/53 anchor ×2 visited 20 times
- MBPP-class long poison ×0.5: 12 visits
- HE poison ×1.0: 158 visits
avg_ce=0.5390 (close to BD6.8D's 0.5429), avg_loss=0.6008 (higher than BD6.8D's 0.5377 because the ×4 on HE/85 dragged the weighted total up).
Headline result
| pass | trainer | anchor pass | LOST | |------|---------|-------------|------| | BD6.5 (peak) | bd6_5 (no weights) | 15/19 | MBPP/53, HE/34, HE/45, HE/85 | | BD6.8D | bd6_8d {2/2/0.5} | 15/19 | HE/27, HE/34, HE/45, HE/85 (MBPP/53 RECOVERED) | | BD6.8D2 | bd6_8d2 {2/4/0.5,1.0} | 13/19 | MBPP/53, HE/23, HE/27, HE/34, HE/45, HE/85 |
MBPP/53 regressed back to lost. The very holdout BD6.8D recovered fell again under the refined policy. HE/23 newly regressed — a previously-stable HE anchor. HE/85 still lost — even ×4 wasn't enough.
Why the refinement made things worse
Three knobs were changed simultaneously, each pulling in a slightly different direction:
- HE/85 ×4.0 — pulled gradient very strongly toward this one
row's pattern. With only ~5 HE-anchor copies in the stratified stream, this concentrated the LoRA's HE-side capacity onto a single very-long target. Side effect: HE/23 (previously easy) drifted because its pattern got crowded out.
- HE-poison ×1.0 (was implicitly ×0.5 in BD6.8D for long ones):
restored full poison signal on HE side, which is what BD6.8D2 wanted to avoid HE/27 collateral. But this ALSO restored full poison-vs-anchor tension on HE, which when combined with HE/85's ×4 demand forces the LoRA into a tighter regime than it can represent at r=8.
- MBPP/53 ×2.0 alone (without the helping global long-poison
downweight on its bench): MBPP-anchor's ×2.0 boost was less effective because long-poison was no longer suppressing the competing HE-poison signal.
In short: BD6.8D's policy worked by making the LoRA less ambitious overall (tighter cap on long-poison everywhere). BD6.8D2's policy asks the LoRA to be more ambitious (full HE-poison signal) AND more focused (×4 on HE/85) at once. r=8 capacity isn't enough for both demands.
Pattern across the BD6.x cycle (now decisive)
Every lever the cycle has tried follows the same shape: one moderate setting works, refining it breaks it.
| lever | peak setting | over-tuned next step | cost | |------------------|-----------------------|----------------------|------| | replication | 53 % stratified (BD6.5) | 63 % w/ holdout×50 (BD6.6) | 15 → 11 | | KL distillation | (none — λ=0 was best) | λ=0.05/0.10/0.20 (BD6.7) | 15 → 12,12,10 | | token-weighted CE | {2,2,0.5} (BD6.8D) | {2,4,0.5,1.0} (BD6.8D2) | 15 → 13 |
The local maximum sits at exactly 15/19 under any single training-only lever. The 4 holdouts are not all training-soluble:
- HE/34, HE/45 are decoder-noise (BD6.8F+ proven)
- MBPP/53 is training-soluble (BD6.8D proved with ×2.0)
- HE/85 has resisted both ×2.0 and ×4.0, and ×4.0 actually broke
the rest of the system around it
We are at the r=8 LoRA capacity ceiling for this organ at this runtime configuration. Further training-only experiments will not move the needle.
Per-row v8b table
KEPT (13):
✓ MBPP/17, MBPP/19, MBPP/20, MBPP/41, MBPP/51, MBPP/52,
✓ MBPP/64, MBPP/90, MBPP/93, MBPP/96, MBPP/99, MBPP/105
✓ HumanEval/53
LOST (6):
✗ MBPP/53 ← REGRESSED FROM BD6.8D recovery
✗ HumanEval/23 ← NEW REGRESSION (was stable in BD6.5/8D)
✗ HumanEval/27 ← still lost
✗ HumanEval/34 ← decoder-noise
✗ HumanEval/45 ← decoder-noise
✗ HumanEval/85 ← still lost (×4 insufficient)
What stays open
If the user wants to pursue further, only one orthogonal lever remains:
BD6.8D-rank — increase LoRA capacity
- r=16, alpha=32 (4× param count)
- same {MBPP/53 ×2, HE/85 ×2} weight policy as BD6.8D (not D2)
- same dataset
- tests whether r=8 was the capacity ceiling
If r=16 doesn't move HE/85, the ceiling is truly architecture + verifier + decoder, not training-side.
Otherwise: ship
- Production stays at BD6 pass-1 (anchor 19/19 under current decoder).
- BD6.5 is documented as best-known LoRA artifact (15/19, weights are
cleaner than gate shows under greedy).
- MBPP B = 13/100, HE B = 6/164, LCB = 0/50 stand as the bench numbers.
Production state (after BD6.8D2 revert)
PHYS05_PACK = physarum05b_code_skeleton.planck(BD6 pass-1).- phys05_code_skeleton spec: rep=1.15, ng=2, cuda_rep=1.05 (unchanged).
physarum05b_code_skeleton_v8b.planckarchived.tools/surgery/output/code_skeleton_lora_v8b/archived (final + 5 mid-checkpoints).tools/surgery/output/Physarum05B-CodeSkeleton-v8b/archived (rejected merged HF dir).
Files this pass touched
tools/surgery/train_code_skeleton_lora_bd6_8d2.py— new, refined trainerphysarum05b_code_skeleton_v8b.planck— repacked (rejected)tools/surgery/output/code_skeleton_lora_v8b/— adapter + ckpts (rejected)tools/surgery/output/Physarum05B-CodeSkeleton-v8b/— merged HF dir (rejected)src/organs/organ_manager.cpp::PHYS05_PACK— flipped to v8b then back to v1reports/BD6_8D2_OVER_TUNED.md— this file
What this proves
- Token-weighted CE has a saturation point, exactly like
replication and (effectively) like KL. Pushing past it breaks the LoRA's anchor consistency more than it helps the targeted holdout.
- r=8 LoRA capacity is the binding constraint. Three different
training-side levers all peak at 15/19. Either the rank needs to go up, or the runtime/verifier side has the remaining headroom.
- The strict gate has held seven times in a row (BD6.3, .4, .5,
.6, .7, .8D, .8D2). Production has stayed at 19/19 throughout. The gate is doing exactly what it was designed for.
The cleanest read of the BD6.x cycle: BD6.5's stratified, lr=5e-5, no-KL, no-weighted recipe is the best training-only result achievable at r=8/alpha=16. Anything beyond requires either rank bump or decoder/verifier surgery — neither of which is a "finish BD6" task.