BD6.8D — token-weighted CE, MBPP fully fixed but HE side collapses (2026-05-02)
TL;DR — Token-weighted CE delivered the FIRST surgery pass to fully recover MBPP/53 (the 256-char weight-gap holdout from BD6.5). All 13 MBPP anchors now pass. But HumanEval side regressed: HE/27 (which was stable in BD6.5) lost, plus HE/34, /45, /85 still lost. Net 15/19 — same number as BD6.5, but a structurally different lost set (0 MBPP losses + 4 HE losses, vs BD6.5's 1 MBPP + 3 HE). Strict gate rejects. Production reverted.
The lever is real: targeted ×2.0 weight on MBPP/53 did fix it. But ×2.0 wasn't enough for the longer HE/85 (366 chars), and the poison-long downweight × 0.5 starved the HumanEval poison signal, breaking HE/27. Two follow-ups remain (BD6.8D2 with stronger holdout mult on HE/85, OR per-bench poison weighting), but BD6.5 (15/19) remains the local peak by anchor count.
What changed vs BD6.5
Only the trainer's per-row loss weighting. Dataset (bd6_5_mixed_train.jsonl, 525 rows, 53 % anchor share, stratified) and hyperparams (r=8, alpha=16, lr=5e-5, ep=1) identical to BD6.5.
Weighting policy:
| row class | weight | |----------------------------------------------------|--------| | anchor in {MBPP/53, HumanEval/85} (the holdouts) | 2.0 | | anchor (other 17) | 1.0 | | poison + target_tokens > 100 | 0.5 | | poison + target_tokens ≤ 100 | 1.0 |
Per-step gradient is out.loss × weight (out.loss is HF mean CE). Stratified anchor/poison alternation as in BD6.5.
Counts during epoch: 40 holdout-2×-weighted steps (10 anchor copies of each holdout × 2 holdouts × 2 weight visits / stratified roll — actually 40 = MBPP/53 × 20 + HE/85 × 20 since each has 20 reps in bd6_5_mixed_train.jsonl), and 157 long-poison ×0.5-weighted steps.
Loss curve summary: avg_ce=0.5429, avg_loss=0.5377 (loss<ce because 0.5× poison-long downweighting pulled total below mean CE).
Headline results
| pass | dataset | trainer | anchor pass | gate | |------|---------|---------|-------------|------| | BD6.5 (peak) | bd6_5_mixed_train | bd6_5 | 15/19 | REVERT | | BD6.8D | bd6_5_mixed_train (UNCHANGED) | bd6_8d (token-weighted CE) | 15/19 | REVERT |
Same anchor count, structurally different lost set:
| pass | KEPT | LOST | |------|------|------| | BD6.5 | 12 MBPP + 3 HE | MBPP/53, HE/34, HE/45, HE/85 | | BD6.8D | 13 MBPP + 2 HE | HE/27, HE/34, HE/45, HE/85 |
MBPP/53 RECOVERED. All 13 MBPP anchors now pass. The targeted ×2.0 weight on the explicit holdout worked.
HE/27 lost. A previously-stable HE anchor regressed because the long-poison ×0.5 downweighting starved the HumanEval poison signal (HE poison rows tend to be longer than MBPP, so they dispoportionately got 0.5×).
HE/85 still lost. The other targeted holdout (366 chars, longest target in the set). ×2.0 wasn't enough. May need ×3-4.
HE/34, HE/45 still lost. These are decoder-noise per BD6.8F+ diagnostic — they're not fixable via training-time weighting at all (only by softening the production decoder, which BD6.8F+ proved isn't safe).
What this proves
- The diagnostic from BD6.8F+ was correct. The 4 BD6.5 holdouts
split 2+2:
- MBPP/53 — real weight gap, and BD6.8D fixed it via ×2.0
targeted weight. ✓ proven fixable.
- HE/85 — also classified as real weight gap, but ×2.0 wasn't
enough. Needs more aggressive weighting OR the LoRA's rank/capacity isn't sufficient for a 366-char Python target.
- HE/34, HE/45 — decoder-noise, BD6.8D as expected didn't
touch them (they remain lost under production decoder).
- Token-weighted CE is a clean per-row gradient lever. It works
exactly as designed: targeted anchors get more learning signal, targeted poison rows get less. No KL, no replication change, no dataset shape change.
- The poison-long downweight (×0.5) is too coarse. It applies
uniformly across HE+MBPP+LCB poison; HE poison is disproportionately long, so HE poison signal gets cut more than MBPP. This caused HE/27 collateral. Future: per-bench poison weighting, not global length-only.
Per-row table
KEPT (15):
✓ MBPP/17, MBPP/19, MBPP/20, MBPP/41, MBPP/51, MBPP/52,
✓ MBPP/53 ← RECOVERED FROM BD6.5 HOLDOUT (token-weighted CE worked)
✓ MBPP/64, MBPP/90, MBPP/93, MBPP/96, MBPP/99, MBPP/105
✓ HumanEval/23, HumanEval/53
LOST (4):
✗ HumanEval/27 ← NEW REGRESSION (was kept in BD6.5)
✗ HumanEval/34 ← decoder-noise (BD6.8F+ analysis)
✗ HumanEval/45 ← decoder-noise (BD6.8F+ analysis)
✗ HumanEval/85 ← still weight gap, ×2.0 insufficient (366 char target)
What stays open after BD6.8D
Out of 4 BD6.5 holdouts, 1 is fixed (MBPP/53), 1 partial (HE/85 needs more weight), 2 are decoder-bound (HE/34, HE/45 — can't be fixed without breaking production decoder). 1 collateral regression (HE/27). The 3 HumanEval losses are dominating; any next attempt should focus on HE side without breaking MBPP.
Numbers across full BD6.x cycle
| pass | trainer change | anchor | note | |-------------|-----------------------------------------------|--------|------| | BD6 pass-1 | poison v1 (no anchor) | 19/19 | KEEP | | BD6.3 | fresh poison only | 0/19 | REVERT — catastrophic forgetting | | BD6.4 | + 5× anchor replication | 7/19 | REVERT | | BD6.5 | + bench-aware repl + stratified, 53 % anchor | 15/19 | REVERT (peak) | | BD6.6 | + holdout × 50, 63 % anchor | 11/19 | REVERT — over-anchor | | BD6.7a/b/c | + KL anchor ladder λ=0.10/0.20/0.05 | 12/10/12 | REVERT — KL redundant | | BD6.8F | runtime determinism probe (no training) | — | diagnostic: 2+2 split | | BD6.8F+ | (rep_penalty, ngram) decoder grid | — | diagnostic: no overlap | | BD6.8D | token-weighted CE (no other changes) | 15/19 | REVERT — MBPP fixed, HE collateral |
Possible follow-ups (pending user GO; do not act)
BD6.8D2 — refine targeting
Two parallel adjustments at the same trainer:
- Raise holdout_mult on HE/85 specifically to 3.0 or 4.0 (only HE/85,
keep MBPP/53 at 2.0).
- Replace the global long-poison ×0.5 with per-bench poison
weight: scale only MBPP-source long poison, keep HE-source poison at full 1.0 (preserves HE/27 etc).
Both are 5-line changes to bd6_8d trainer. ~25 min training + gate.
BD6.8D-rank — increase LoRA capacity
If HE/85 cannot be fixed even at 4×, the issue may be insufficient rank for a 366-char target. Try r=16, alpha=32 (4× param count). ~30-40 min training + gate.
BD6.9 — accept BD6.5 as ceiling
If neither D2 nor rank-bump moves HE/85, accept that BD6.5 ceiling of 15/19 is the real anchor ceiling for this organ at this runtime+verifier configuration. Promote BD6.5 to "tier-1 alternative" behind a feature flag with explicit per-bench gating, never as default production.
Production state (after BD6.8D revert)
PHYS05_PACK = physarum05b_code_skeleton.planck(BD6 pass-1).- phys05_code_skeleton spec: rep=1.15, ng=2, cuda_rep=1.05 (unchanged).
- Anchor 19/19 verified post-revert.
physarum05b_code_skeleton_v8.planckarchived.tools/surgery/output/code_skeleton_lora_v8/archived (final + 5 mid-checkpoints @ 100, 200, 300, 400, 500).tools/surgery/output/Physarum05B-CodeSkeleton-v8/archived.
Files this pass touched
tools/surgery/train_code_skeleton_lora_bd6_8d.py— new, token-weighted CE trainerphysarum05b_code_skeleton_v8.planck— repacked (rejected, archived)tools/surgery/output/code_skeleton_lora_v8/— adapter + 5 checkpoints (rejected)tools/surgery/output/Physarum05B-CodeSkeleton-v8/— merged HF dir (rejected)src/organs/organ_manager.cpp::PHYS05_PACK— flipped to v8 then back to v1reports/BD6_8D_TOKEN_WEIGHTED_CE.md— this file
The targeted lever works. Strict gate still rejects. The remaining HE-side losses point at LoRA capacity, decoder fragility, and asymmetric poison weighting — three different problems at the boundary of what training-only surgery can do.