# BD6.8D — token-weighted CE, MBPP fully fixed but HE side collapses (2026-05-02)

**TL;DR — Token-weighted CE delivered the FIRST surgery pass to fully
recover MBPP/53** (the 256-char weight-gap holdout from BD6.5). All 13
MBPP anchors now pass. **But HumanEval side regressed**: HE/27 (which
was stable in BD6.5) lost, plus HE/34, /45, /85 still lost. Net 15/19
— same number as BD6.5, but a structurally different lost set
(0 MBPP losses + 4 HE losses, vs BD6.5's 1 MBPP + 3 HE). Strict gate
rejects. Production reverted.

The lever is real: targeted ×2.0 weight on MBPP/53 *did* fix it.
But ×2.0 wasn't enough for the longer HE/85 (366 chars), and the
poison-long downweight × 0.5 starved the HumanEval poison signal,
breaking HE/27. Two follow-ups remain (BD6.8D2 with stronger holdout
mult on HE/85, OR per-bench poison weighting), but BD6.5 (15/19)
remains the local peak by anchor count.

---

## What changed vs BD6.5

Only the trainer's per-row loss weighting. Dataset
(`bd6_5_mixed_train.jsonl`, 525 rows, 53 % anchor share, stratified)
and hyperparams (r=8, alpha=16, lr=5e-5, ep=1) **identical to BD6.5**.

Weighting policy:

| row class                                          | weight |
|----------------------------------------------------|--------|
| anchor in {MBPP/53, HumanEval/85} (the holdouts)   | **2.0** |
| anchor (other 17)                                  | 1.0    |
| poison + target_tokens > 100                       | **0.5** |
| poison + target_tokens ≤ 100                       | 1.0    |

Per-step gradient is `out.loss × weight` (out.loss is HF mean CE).
Stratified anchor/poison alternation as in BD6.5.

Counts during epoch: 40 holdout-2×-weighted steps (10 anchor copies
of each holdout × 2 holdouts × 2 weight visits / stratified roll —
actually 40 = MBPP/53 × 20 + HE/85 × 20 since each has 20 reps in
bd6_5_mixed_train.jsonl), and 157 long-poison ×0.5-weighted steps.

Loss curve summary: avg_ce=0.5429, avg_loss=0.5377 (loss<ce because
0.5× poison-long downweighting pulled total below mean CE).

## Headline results

| pass | dataset | trainer | anchor pass | gate |
|------|---------|---------|-------------|------|
| BD6.5  (peak) | bd6_5_mixed_train | bd6_5 | 15/19 | REVERT |
| **BD6.8D**    | bd6_5_mixed_train (UNCHANGED) | **bd6_8d (token-weighted CE)** | **15/19** | **REVERT** |

Same anchor count, **structurally different lost set**:

| pass | KEPT | LOST |
|------|------|------|
| BD6.5  | 12 MBPP + 3 HE | MBPP/53, HE/34, HE/45, HE/85 |
| BD6.8D | **13 MBPP + 2 HE** | HE/27, HE/34, HE/45, HE/85 |

**MBPP/53 RECOVERED.** All 13 MBPP anchors now pass. The targeted
×2.0 weight on the explicit holdout worked.

**HE/27 lost.** A previously-stable HE anchor regressed because
the long-poison ×0.5 downweighting starved the HumanEval poison
signal (HE poison rows tend to be longer than MBPP, so they
dispoportionately got 0.5×).

**HE/85 still lost.** The other targeted holdout (366 chars,
longest target in the set). ×2.0 wasn't enough. May need ×3-4.

**HE/34, HE/45 still lost.** These are decoder-noise per BD6.8F+
diagnostic — they're not fixable via training-time weighting at
all (only by softening the production decoder, which BD6.8F+
proved isn't safe).

## What this proves

* **The diagnostic from BD6.8F+ was correct.** The 4 BD6.5 holdouts
  split 2+2:
  - **MBPP/53** — real weight gap, and BD6.8D fixed it via ×2.0
    targeted weight. ✓ proven fixable.
  - **HE/85** — also classified as real weight gap, but ×2.0 wasn't
    enough. Needs more aggressive weighting OR the LoRA's
    rank/capacity isn't sufficient for a 366-char Python target.
  - **HE/34, HE/45** — decoder-noise, BD6.8D as expected didn't
    touch them (they remain lost under production decoder).
* **Token-weighted CE is a clean per-row gradient lever.** It works
  exactly as designed: targeted anchors get more learning signal,
  targeted poison rows get less. No KL, no replication change, no
  dataset shape change.
* **The poison-long downweight (×0.5) is too coarse.** It applies
  uniformly across HE+MBPP+LCB poison; HE poison is disproportionately
  long, so HE poison signal gets cut more than MBPP. This caused
  HE/27 collateral. Future: per-bench poison weighting, not global
  length-only.

## Per-row table

```
KEPT (15):
  ✓ MBPP/17, MBPP/19, MBPP/20, MBPP/41, MBPP/51, MBPP/52,
  ✓ MBPP/53 ← RECOVERED FROM BD6.5 HOLDOUT (token-weighted CE worked)
  ✓ MBPP/64, MBPP/90, MBPP/93, MBPP/96, MBPP/99, MBPP/105
  ✓ HumanEval/23, HumanEval/53

LOST (4):
  ✗ HumanEval/27   ← NEW REGRESSION (was kept in BD6.5)
  ✗ HumanEval/34   ← decoder-noise (BD6.8F+ analysis)
  ✗ HumanEval/45   ← decoder-noise (BD6.8F+ analysis)
  ✗ HumanEval/85   ← still weight gap, ×2.0 insufficient (366 char target)
```

## What stays open after BD6.8D

Out of 4 BD6.5 holdouts, 1 is fixed (MBPP/53), 1 partial (HE/85
needs more weight), 2 are decoder-bound (HE/34, HE/45 — can't be
fixed without breaking production decoder). 1 collateral
regression (HE/27). The 3 HumanEval losses are dominating; any next
attempt should focus on HE side without breaking MBPP.

## Numbers across full BD6.x cycle

| pass        | trainer change                                | anchor | note |
|-------------|-----------------------------------------------|--------|------|
| BD6 pass-1  | poison v1 (no anchor)                         | 19/19  | KEEP |
| BD6.3       | fresh poison only                             | 0/19   | REVERT — catastrophic forgetting |
| BD6.4       | + 5× anchor replication                       | 7/19   | REVERT |
| BD6.5       | + bench-aware repl + stratified, 53 % anchor  | **15/19** | REVERT (peak) |
| BD6.6       | + holdout × 50, 63 % anchor                   | 11/19  | REVERT — over-anchor |
| BD6.7a/b/c  | + KL anchor ladder λ=0.10/0.20/0.05           | 12/10/12 | REVERT — KL redundant |
| BD6.8F      | runtime determinism probe (no training)       | —      | diagnostic: 2+2 split |
| BD6.8F+     | (rep_penalty, ngram) decoder grid             | —      | diagnostic: no overlap |
| **BD6.8D**  | **token-weighted CE (no other changes)**      | **15/19** | **REVERT — MBPP fixed, HE collateral** |

## Possible follow-ups (pending user GO; do not act)

### BD6.8D2 — refine targeting

Two parallel adjustments at the same trainer:
1. Raise holdout_mult on HE/85 specifically to 3.0 or 4.0 (only HE/85,
   keep MBPP/53 at 2.0).
2. Replace the global long-poison ×0.5 with **per-bench** poison
   weight: scale only MBPP-source long poison, keep HE-source poison
   at full 1.0 (preserves HE/27 etc).

Both are 5-line changes to bd6_8d trainer. ~25 min training + gate.

### BD6.8D-rank — increase LoRA capacity

If HE/85 cannot be fixed even at 4×, the issue may be insufficient
rank for a 366-char target. Try r=16, alpha=32 (4× param count).
~30-40 min training + gate.

### BD6.9 — accept BD6.5 as ceiling

If neither D2 nor rank-bump moves HE/85, accept that BD6.5 ceiling
of 15/19 is the *real* anchor ceiling for this organ at this
runtime+verifier configuration. Promote BD6.5 to "tier-1 alternative"
behind a feature flag with explicit per-bench gating, never as
default production.

## Production state (after BD6.8D revert)

* `PHYS05_PACK = physarum05b_code_skeleton.planck` (BD6 pass-1).
* phys05_code_skeleton spec: rep=1.15, ng=2, cuda_rep=1.05 (unchanged).
* Anchor 19/19 verified post-revert.
* `physarum05b_code_skeleton_v8.planck` archived.
* `tools/surgery/output/code_skeleton_lora_v8/` archived (final + 5 mid-checkpoints @ 100, 200, 300, 400, 500).
* `tools/surgery/output/Physarum05B-CodeSkeleton-v8/` archived.

## Files this pass touched

* `tools/surgery/train_code_skeleton_lora_bd6_8d.py` — new, token-weighted CE trainer
* `physarum05b_code_skeleton_v8.planck` — repacked (rejected, archived)
* `tools/surgery/output/code_skeleton_lora_v8/` — adapter + 5 checkpoints (rejected)
* `tools/surgery/output/Physarum05B-CodeSkeleton-v8/` — merged HF dir (rejected)
* `src/organs/organ_manager.cpp::PHYS05_PACK` — flipped to v8 then back to v1
* `reports/BD6_8D_TOKEN_WEIGHTED_CE.md` — this file

The targeted lever works. Strict gate still rejects. The remaining
HE-side losses point at LoRA capacity, decoder fragility, and
asymmetric poison weighting — three different problems at the
boundary of what training-only surgery can do.
