BD6.8D — token-weighted CE, MBPP fully fixed but HE side collapses (2026-05-02)

TL;DR — Token-weighted CE delivered the FIRST surgery pass to fully recover MBPP/53 (the 256-char weight-gap holdout from BD6.5). All 13 MBPP anchors now pass. But HumanEval side regressed: HE/27 (which was stable in BD6.5) lost, plus HE/34, /45, /85 still lost. Net 15/19 — same number as BD6.5, but a structurally different lost set (0 MBPP losses + 4 HE losses, vs BD6.5's 1 MBPP + 3 HE). Strict gate rejects. Production reverted.

The lever is real: targeted ×2.0 weight on MBPP/53 did fix it. But ×2.0 wasn't enough for the longer HE/85 (366 chars), and the poison-long downweight × 0.5 starved the HumanEval poison signal, breaking HE/27. Two follow-ups remain (BD6.8D2 with stronger holdout mult on HE/85, OR per-bench poison weighting), but BD6.5 (15/19) remains the local peak by anchor count.

What changed vs BD6.5

Only the trainer's per-row loss weighting. Dataset (bd6_5_mixed_train.jsonl, 525 rows, 53 % anchor share, stratified) and hyperparams (r=8, alpha=16, lr=5e-5, ep=1) identical to BD6.5.

Weighting policy:

| row class | weight | |----------------------------------------------------|--------| | anchor in {MBPP/53, HumanEval/85} (the holdouts) | 2.0 | | anchor (other 17) | 1.0 | | poison + target_tokens > 100 | 0.5 | | poison + target_tokens ≤ 100 | 1.0 |

Per-step gradient is out.loss × weight (out.loss is HF mean CE). Stratified anchor/poison alternation as in BD6.5.

Counts during epoch: 40 holdout-2×-weighted steps (10 anchor copies of each holdout × 2 holdouts × 2 weight visits / stratified roll — actually 40 = MBPP/53 × 20 + HE/85 × 20 since each has 20 reps in bd6_5_mixed_train.jsonl), and 157 long-poison ×0.5-weighted steps.

Loss curve summary: avg_ce=0.5429, avg_loss=0.5377 (loss<ce because 0.5× poison-long downweighting pulled total below mean CE).

Headline results

| pass | dataset | trainer | anchor pass | gate | |------|---------|---------|-------------|------| | BD6.5 (peak) | bd6_5_mixed_train | bd6_5 | 15/19 | REVERT | | BD6.8D | bd6_5_mixed_train (UNCHANGED) | bd6_8d (token-weighted CE) | 15/19 | REVERT |

Same anchor count, structurally different lost set:

| pass | KEPT | LOST | |------|------|------| | BD6.5 | 12 MBPP + 3 HE | MBPP/53, HE/34, HE/45, HE/85 | | BD6.8D | 13 MBPP + 2 HE | HE/27, HE/34, HE/45, HE/85 |

MBPP/53 RECOVERED. All 13 MBPP anchors now pass. The targeted ×2.0 weight on the explicit holdout worked.

HE/27 lost. A previously-stable HE anchor regressed because the long-poison ×0.5 downweighting starved the HumanEval poison signal (HE poison rows tend to be longer than MBPP, so they dispoportionately got 0.5×).

HE/85 still lost. The other targeted holdout (366 chars, longest target in the set). ×2.0 wasn't enough. May need ×3-4.

HE/34, HE/45 still lost. These are decoder-noise per BD6.8F+ diagnostic — they're not fixable via training-time weighting at all (only by softening the production decoder, which BD6.8F+ proved isn't safe).

What this proves

The diagnostic from BD6.8F+ was correct. The 4 BD6.5 holdouts

split 2+2:

MBPP/53 — real weight gap, and BD6.8D fixed it via ×2.0

targeted weight. ✓ proven fixable.

HE/85 — also classified as real weight gap, but ×2.0 wasn't

enough. Needs more aggressive weighting OR the LoRA's rank/capacity isn't sufficient for a 366-char Python target.

HE/34, HE/45 — decoder-noise, BD6.8D as expected didn't

touch them (they remain lost under production decoder).

Token-weighted CE is a clean per-row gradient lever. It works

exactly as designed: targeted anchors get more learning signal, targeted poison rows get less. No KL, no replication change, no dataset shape change.

The poison-long downweight (×0.5) is too coarse. It applies

uniformly across HE+MBPP+LCB poison; HE poison is disproportionately long, so HE poison signal gets cut more than MBPP. This caused HE/27 collateral. Future: per-bench poison weighting, not global length-only.

Per-row table

KEPT (15):
  ✓ MBPP/17, MBPP/19, MBPP/20, MBPP/41, MBPP/51, MBPP/52,
  ✓ MBPP/53 ← RECOVERED FROM BD6.5 HOLDOUT (token-weighted CE worked)
  ✓ MBPP/64, MBPP/90, MBPP/93, MBPP/96, MBPP/99, MBPP/105
  ✓ HumanEval/23, HumanEval/53

LOST (4):
  ✗ HumanEval/27   ← NEW REGRESSION (was kept in BD6.5)
  ✗ HumanEval/34   ← decoder-noise (BD6.8F+ analysis)
  ✗ HumanEval/45   ← decoder-noise (BD6.8F+ analysis)
  ✗ HumanEval/85   ← still weight gap, ×2.0 insufficient (366 char target)

What stays open after BD6.8D

Out of 4 BD6.5 holdouts, 1 is fixed (MBPP/53), 1 partial (HE/85 needs more weight), 2 are decoder-bound (HE/34, HE/45 — can't be fixed without breaking production decoder). 1 collateral regression (HE/27). The 3 HumanEval losses are dominating; any next attempt should focus on HE side without breaking MBPP.

Numbers across full BD6.x cycle

| pass | trainer change | anchor | note | |-------------|-----------------------------------------------|--------|------| | BD6 pass-1 | poison v1 (no anchor) | 19/19 | KEEP | | BD6.3 | fresh poison only | 0/19 | REVERT — catastrophic forgetting | | BD6.4 | + 5× anchor replication | 7/19 | REVERT | | BD6.5 | + bench-aware repl + stratified, 53 % anchor | 15/19 | REVERT (peak) | | BD6.6 | + holdout × 50, 63 % anchor | 11/19 | REVERT — over-anchor | | BD6.7a/b/c | + KL anchor ladder λ=0.10/0.20/0.05 | 12/10/12 | REVERT — KL redundant | | BD6.8F | runtime determinism probe (no training) | — | diagnostic: 2+2 split | | BD6.8F+ | (rep_penalty, ngram) decoder grid | — | diagnostic: no overlap | | BD6.8D | token-weighted CE (no other changes) | 15/19 | REVERT — MBPP fixed, HE collateral |

Possible follow-ups (pending user GO; do not act)

BD6.8D2 — refine targeting

Two parallel adjustments at the same trainer:

Raise holdout_mult on HE/85 specifically to 3.0 or 4.0 (only HE/85,

keep MBPP/53 at 2.0).

Replace the global long-poison ×0.5 with per-bench poison

weight: scale only MBPP-source long poison, keep HE-source poison at full 1.0 (preserves HE/27 etc).

Both are 5-line changes to bd6_8d trainer. ~25 min training + gate.

BD6.8D-rank — increase LoRA capacity

If HE/85 cannot be fixed even at 4×, the issue may be insufficient rank for a 366-char target. Try r=16, alpha=32 (4× param count). ~30-40 min training + gate.

BD6.9 — accept BD6.5 as ceiling

If neither D2 nor rank-bump moves HE/85, accept that BD6.5 ceiling of 15/19 is the real anchor ceiling for this organ at this runtime+verifier configuration. Promote BD6.5 to "tier-1 alternative" behind a feature flag with explicit per-bench gating, never as default production.

Production state (after BD6.8D revert)

PHYS05_PACK = physarum05b_code_skeleton.planck (BD6 pass-1).
phys05_code_skeleton spec: rep=1.15, ng=2, cuda_rep=1.05 (unchanged).
Anchor 19/19 verified post-revert.
physarum05b_code_skeleton_v8.planck archived.
tools/surgery/output/code_skeleton_lora_v8/ archived (final + 5 mid-checkpoints @ 100, 200, 300, 400, 500).
tools/surgery/output/Physarum05B-CodeSkeleton-v8/ archived.

Files this pass touched

tools/surgery/train_code_skeleton_lora_bd6_8d.py — new, token-weighted CE trainer
physarum05b_code_skeleton_v8.planck — repacked (rejected, archived)
tools/surgery/output/code_skeleton_lora_v8/ — adapter + 5 checkpoints (rejected)
tools/surgery/output/Physarum05B-CodeSkeleton-v8/ — merged HF dir (rejected)
src/organs/organ_manager.cpp::PHYS05_PACK — flipped to v8 then back to v1
reports/BD6_8D_TOKEN_WEIGHTED_CE.md — this file

The targeted lever works. Strict gate still rejects. The remaining HE-side losses point at LoRA capacity, decoder fragility, and asymmetric poison weighting — three different problems at the boundary of what training-only surgery can do.