# BD6.8D2 — token-weighted CE refinement OVERTUNED, peak is BD6.8D (2026-05-02)

**TL;DR — refining BD6.8D's weighting policy from {MBPP/53 ×2, HE/85
×2, all-long-poison ×0.5} to {MBPP/53 ×2, HE/85 ×4, MBPP-poison-long
×0.5, HE-poison ×1.0} dropped anchor pass-rate from 15/19 → 13/19 AND
regressed MBPP/53 (which BD6.8D had recovered). Six rows lost in v8b.
Strict gate rejects. Production reverted. The BD6.x cycle's pattern
holds: each lever has a saturation point, and pushing beyond it
destabilizes. BD6.8D's modest weighting {2/2/0.5} = 15/19 was the peak
of the token-weighted CE lever — exactly mirroring how BD6.5 was the
peak of replication and BD6.7 KL had no useful rung. Training-only
levers are now exhausted.**

---

## What changed vs BD6.8D

Per-row weight policy refined per user spec:

| row class                                    | BD6.8D | BD6.8D2 |
|----------------------------------------------|--------|---------|
| anchor MBPP/53                               | 2.0    | 2.0     |
| anchor HumanEval/85                          | 2.0    | **4.0** |
| anchor (other 17)                            | 1.0    | 1.0     |
| poison MBPP-class + tokens > 100 (long)      | 0.5    | 0.5     |
| poison HumanEval-class                       | 0.5 (if long) | **1.0 (always)** |
| poison short                                 | 1.0    | 1.0     |

Hyperparams identical: r=8, alpha=16, lr=5e-5, ep=1, ckpt-step=100.
Dataset `bd6_5_mixed_train.jsonl` unchanged.

Trainer counts during epoch:
* HE/85 anchor ×4 visited 20 times
* MBPP/53 anchor ×2 visited 20 times
* MBPP-class long poison ×0.5: 12 visits
* HE poison ×1.0: 158 visits

avg_ce=0.5390 (close to BD6.8D's 0.5429), avg_loss=0.6008 (higher than
BD6.8D's 0.5377 because the ×4 on HE/85 dragged the weighted total up).

## Headline result

| pass | trainer | anchor pass | LOST |
|------|---------|-------------|------|
| BD6.5 (peak) | bd6_5 (no weights) | 15/19 | MBPP/53, HE/34, HE/45, HE/85 |
| BD6.8D       | bd6_8d {2/2/0.5}   | **15/19** | HE/27, HE/34, HE/45, HE/85 (MBPP/53 RECOVERED) |
| **BD6.8D2**  | bd6_8d2 {2/4/0.5,1.0} | **13/19** | MBPP/53, HE/23, HE/27, HE/34, HE/45, HE/85 |

**MBPP/53 regressed back to lost.** The very holdout BD6.8D recovered
fell again under the refined policy.
**HE/23 newly regressed** — a previously-stable HE anchor.
**HE/85 still lost** — even ×4 wasn't enough.

## Why the refinement made things worse

Three knobs were changed simultaneously, each pulling in a slightly
different direction:

1. **HE/85 ×4.0** — pulled gradient very strongly toward this one
   row's pattern. With only ~5 HE-anchor copies in the stratified
   stream, this concentrated the LoRA's HE-side capacity onto a
   single very-long target. Side effect: HE/23 (previously easy)
   drifted because its pattern got crowded out.
2. **HE-poison ×1.0** (was implicitly ×0.5 in BD6.8D for long ones):
   restored full poison signal on HE side, which is what BD6.8D2
   wanted to avoid HE/27 collateral. But this ALSO restored full
   poison-vs-anchor tension on HE, which when combined with HE/85's
   ×4 demand forces the LoRA into a tighter regime than it can
   represent at r=8.
3. **MBPP/53 ×2.0** alone (without the helping global long-poison
   downweight on its bench): MBPP-anchor's ×2.0 boost was less
   effective because long-poison was no longer suppressing the
   competing HE-poison signal.

In short: BD6.8D's policy worked by making the LoRA *less ambitious
overall* (tighter cap on long-poison everywhere). BD6.8D2's policy
asks the LoRA to be *more ambitious* (full HE-poison signal) AND
*more focused* (×4 on HE/85) at once. r=8 capacity isn't enough for
both demands.

## Pattern across the BD6.x cycle (now decisive)

Every lever the cycle has tried follows the same shape: **one
moderate setting works, refining it breaks it.**

| lever            | peak setting          | over-tuned next step | cost |
|------------------|-----------------------|----------------------|------|
| replication      | 53 % stratified (BD6.5) | 63 % w/ holdout×50 (BD6.6) | 15 → 11 |
| KL distillation  | (none — λ=0 was best) | λ=0.05/0.10/0.20 (BD6.7) | 15 → 12,12,10 |
| token-weighted CE | {2,2,0.5} (BD6.8D)    | {2,4,0.5,1.0} (BD6.8D2) | 15 → 13 |

**The local maximum sits at exactly 15/19** under any single
training-only lever. The 4 holdouts are not all training-soluble:
- HE/34, HE/45 are decoder-noise (BD6.8F+ proven)
- MBPP/53 is training-soluble (BD6.8D proved with ×2.0)
- HE/85 has resisted both ×2.0 and ×4.0, and ×4.0 actually broke
  the rest of the system around it

We are at the **r=8 LoRA capacity ceiling** for this organ at this
runtime configuration. Further training-only experiments will not
move the needle.

## Per-row v8b table

```
KEPT (13):
  ✓ MBPP/17, MBPP/19, MBPP/20, MBPP/41, MBPP/51, MBPP/52,
  ✓ MBPP/64, MBPP/90, MBPP/93, MBPP/96, MBPP/99, MBPP/105
  ✓ HumanEval/53

LOST (6):
  ✗ MBPP/53        ← REGRESSED FROM BD6.8D recovery
  ✗ HumanEval/23   ← NEW REGRESSION (was stable in BD6.5/8D)
  ✗ HumanEval/27   ← still lost
  ✗ HumanEval/34   ← decoder-noise
  ✗ HumanEval/45   ← decoder-noise
  ✗ HumanEval/85   ← still lost (×4 insufficient)
```

## What stays open

If the user wants to pursue further, only one orthogonal lever
remains:

### BD6.8D-rank — increase LoRA capacity
* r=16, alpha=32 (4× param count)
* same {MBPP/53 ×2, HE/85 ×2} weight policy as BD6.8D (not D2)
* same dataset
* tests whether r=8 was the *capacity* ceiling

If r=16 doesn't move HE/85, the ceiling is truly architecture +
verifier + decoder, not training-side.

### Otherwise: ship

* Production stays at BD6 pass-1 (anchor 19/19 under current decoder).
* BD6.5 is documented as best-known LoRA artifact (15/19, weights are
  cleaner than gate shows under greedy).
* MBPP B = 13/100, HE B = 6/164, LCB = 0/50 stand as the bench numbers.

## Production state (after BD6.8D2 revert)

* `PHYS05_PACK = physarum05b_code_skeleton.planck` (BD6 pass-1).
* phys05_code_skeleton spec: rep=1.15, ng=2, cuda_rep=1.05 (unchanged).
* `physarum05b_code_skeleton_v8b.planck` archived.
* `tools/surgery/output/code_skeleton_lora_v8b/` archived (final + 5 mid-checkpoints).
* `tools/surgery/output/Physarum05B-CodeSkeleton-v8b/` archived (rejected merged HF dir).

## Files this pass touched

* `tools/surgery/train_code_skeleton_lora_bd6_8d2.py` — new, refined trainer
* `physarum05b_code_skeleton_v8b.planck` — repacked (rejected)
* `tools/surgery/output/code_skeleton_lora_v8b/` — adapter + ckpts (rejected)
* `tools/surgery/output/Physarum05B-CodeSkeleton-v8b/` — merged HF dir (rejected)
* `src/organs/organ_manager.cpp::PHYS05_PACK` — flipped to v8b then back to v1
* `reports/BD6_8D2_OVER_TUNED.md` — this file

## What this proves

* **Token-weighted CE has a saturation point**, exactly like
  replication and (effectively) like KL. Pushing past it breaks the
  LoRA's anchor consistency more than it helps the targeted holdout.
* **r=8 LoRA capacity is the binding constraint.** Three different
  training-side levers all peak at 15/19. Either the rank needs to
  go up, or the runtime/verifier side has the remaining headroom.
* **The strict gate has held seven times in a row** (BD6.3, .4, .5,
  .6, .7, .8D, .8D2). Production has stayed at 19/19 throughout.
  The gate is doing exactly what it was designed for.

The cleanest read of the BD6.x cycle: BD6.5's stratified, lr=5e-5,
no-KL, no-weighted recipe is the best **training-only** result
achievable at r=8/alpha=16. Anything beyond requires either rank
bump or decoder/verifier surgery — neither of which is a
"finish BD6" task.