BD6.8F+ — decoder grid: no sweet spot exists (2026-05-02)
TL;DR — at the (cuda_rep_penalty, no_repeat_ngram) granularity, NO decoder spec keeps production at 19/19 AND lifts BD6.5 above 15/19. The two requirements are mutually exclusive: production needs cuda_rep ≥ 1.05 to pass anchor; BD6.5 only gains anchors when cuda_rep ≤ 1.00. The intervals don't overlap. Lever F+ is closed.
Next step is BD6.8D (token-weighted CE on BD6.5 dataset shape) — that's the path to recover MBPP/53 + HE/85 (the two real weight-gap holdouts) without changing decoder.
Grid tested
CPU repetition_penalty held at 1.15 (no effect on CUDA backend used by phys05_code_skeleton). Varied cuda_rep_penalty × no_repeat_ngram:
| cell | cuda_rep | ngram | prod pack | v5 pack | note | |------|----------|-------|-----------|---------|------| | C0 (current) | 1.05 | 2 | 19/19 | 15/19 | production decoder, also BD6.5's measurement | | C1 | 1.05 | 0 | 19/19 | 15/19 | drop ngram only — prod safe, v5 unchanged | | C2 | 1.02 | 2 | 4/19 | 15/19 | soften cuda_rep slightly — prod collapses | | C4 | 1.00 | 2 | 1/19 | 16/19 | kill cuda_rep, keep ngram — prod cratered, v5 +1 | | C5 (greedy) | 1.00 | 0 | 1/19 | 16/19 | pure greedy — prod cratered, v5 +1 |
(Cell C3 = (1.02, 0) was skipped: at cuda_rep=1.02 production already fell to 4/19 in C2, so dropping ngram on top wouldn't recover prod.)
Two non-overlapping intervals
cuda_rep_penalty
╔══════ 1.05 ══════╗ ╔═ 1.02 ═╗ ╔══ 1.00 ══╗
prod ║ 19/19 (safe) ║ ║ 4/19 ║ ║ 1/19 ║
╚══════════════════╝ ╚════════╝ ╚══════════╝
╔══════ 1.05 ══════╗ ╔═ 1.02 ═╗ ╔══ 1.00 ══╗
v5 ║ 15/19 (no gain) ║ ║ 15/19 ║ ║ 16/19 ║
╚══════════════════╝ ╚════════╝ ╚══════════╝
↑ no overlap ↑
where prod stays ≥19 AND v5 ≥17
There is no value of cuda_rep_penalty (at the 0.05 step granularity we tested) that satisfies both constraints simultaneously.
Why this is decisive
- Production BD6 pass-1 is essentially a "hot mode" of the donor
weights — the model only behaves correctly when cuda_rep_penalty is biting hard enough to suppress repetition. Drop it at all and the model loops/repeats and fails compile.
- BD6.5 weights have learned to be less repetition-prone — they
generalize the anchor patterns enough that pure greedy works for most of them. But the residual 4 holdouts (MBPP/53, HE/34, /45, HE/85) fail under the production decoder because cuda_rep_penalty=1.05 perturbs the long-anchor argmax just enough to lose the pattern.
- These are two different operating regimes. Production lives in
the hot regime (rep-pen mandatory); BD6.5 lives in the cool regime (rep-pen unhelpful or harmful). One decoder cannot serve both packs.
Per-row verdict on the 4 BD6.5 holdouts (re-confirmed)
From combined BD6.8F + BD6.8F+ data:
| holdout | C0 prod | C0 v5 | C5 v5 (greedy) | verdict | |---------|---------|-------|----------------|---------| | MBPP/53 | ✓ | ✗ | ✗ | real weight gap (both decoders fail v5) | | HumanEval/34| ✓ | ✗ | ✓ | decoder-noise (greedy fixed v5) | | HumanEval/45| ✓ | ✗ | ✓ | decoder-noise (greedy fixed v5) | | HumanEval/85| ✓ | ✗ | ✗ | real weight gap (both decoders fail v5) |
- HE/34, HE/45 — the LoRA actually has the right weights; only the
decoder change recovers them. But that decoder change costs production.
- MBPP/53, HE/85 — these are the targets BD6.8D (token-weighted CE)
must address.
Decision per spec
Do not merge v5 unless:
- production config is safe ← requires cuda_rep ≥ 1.05
- v5 anchor ≥ 19/19 OR ≥ 17/19 with explicit user approval
Best v5 result at any production-safe spec: 15/19. Best v5 result at any spec at all: 16/19 (greedy, prod cratered). Neither passes the gate. v5 is NOT merged. Production stays unchanged.
Recommended next step
BD6.8D — token-weighted CE, no other changes.
Specifically, modify tools/surgery/train_code_skeleton_lora_bd6_5.py loss step:
# current
total_loss = out.loss
# new (BD6.8D)
seq_len = (labels != -100).sum().clamp(min=1).float()
total_loss = out.loss / seq_len.sqrt()
This makes long targets contribute proportionally less per gradient step, so they don't dominate early training and force the LoRA into pattern-memorization mode that drifts under poison pressure.
Use BD6.5 dataset shape unchanged (bd6_5_mixed_train.jsonl, 525 rows, 53 % anchor share, stratified).
DO NOT add KL (BD6.7 ladder showed it's redundant). DO NOT increase replication (BD6.6 showed saturation at 53 %). DO NOT change decoder for production (BD6.8F+ proved no overlap).
Production state (after BD6.8F+ revert)
PHYS05_PACK = physarum05b_code_skeleton.planck(BD6 pass-1).- phys05_code_skeleton spec: rep=1.15, ng=2, cuda_rep=1.05 (restored).
- Anchor 19/19 verification scheduled post-revert (see verify line at
end of this report or grep [anchor] 19/19 pass in the run logs).
Files this probe touched
src/organs/organ_manager.cpp— phys05_code_skeleton add05() spec
edited 4× during grid, restored to (1.15, 2, 1.05) at end
src/organs/organ_manager.cpp::PHYS05_PACK— flipped prod↔v5 8×, restored to prod/tmp/bd6_8f_plus_grid_summary.txt— raw per-cell results/tmp/bd6_8f_plus_*.log— per-cell anchor logs/tmp/bd6_8f_plus_grid.sh,/tmp/bd6_8f_plus_cell.sh— orchestration scriptsreports/BD6_8F_PLUS_DECODER_GRID.md— this file
No data files written, no LoRA produced, no .planck repacked.
What this proves
- **The (cuda_rep_penalty, no_repeat_ngram) lever is a binary cliff,
not a spectrum.** cuda_rep=1.05 holds production; cuda_rep<1.05 drops it instantly. There's no graceful intermediate.
- BD6.5 has 4 anchor losses with two distinct causes: 2 are
decoder-noise (HE/34, HE/45 — would pass under softer decoder if prod could tolerate it), 2 are real weight gaps (MBPP/53, HE/85 — fail under both decoders). Lever F+ can't help either category at the same time.
- PHYS05_DECODER_LOCKED: any future runtime change that touches
cuda_rep_penalty for phys05_code_skeleton risks cratering production. Worth flagging in docs/CURRENT_TRUTH_LEDGER.md.
The remaining lever for closing MBPP/53 and HE/85 is BD6.8D (token-weighted CE). Awaiting GO.