# BD6.8F — runtime determinism diagnostic, the surprise is in production (2026-05-02)

**TL;DR — the question was "does greedy decode fix the 4 BD6.5 holdouts?".
The answer is split (2 yes, 2 no). The bigger discovery: PRODUCTION
BD6 pass-1 collapses from 19/19 to 1/19 under pure greedy. anchor_positive.jsonl
was captured WITH `rep_penalty=1.15 + no_repeat_ngram=2 + cuda_rep_penalty=1.05`
in the loop, so the gate's "19/19" is a co-engineered weights+decoder
measurement, not a pure-weights one. BD6.5 weights are actually MORE
robust under greedy (16/19) than the production pack (1/19). This
reframes the entire BD6.x cycle: "weights vs decoder" is a tangled
diagnosis, and surgery alone cannot win without considering decoder
co-design.**

---

## What was tested

No training. No LoRA. No merge. No architecture change. Just four
runtime configurations, each running the same 19 anchor prompts via
the production `anchor_eval.py` harness:

```
                       PHYS05 PACK            sampler spec for phys05_code_skeleton
                  ┌─────────────────────┬─────────────────────────────────────┐
A) prod / current │ BD6 pass-1 (.planck)│ rep_pen=1.15 ngram=2 cuda_rep=1.05  │
B) prod / greedy  │ BD6 pass-1 (.planck)│ rep_pen=1.00 ngram=0 cuda_rep=1.00  │
C) v5   / current │ BD6.5      (.planck)│ rep_pen=1.15 ngram=2 cuda_rep=1.05  │
D) v5   / greedy  │ BD6.5      (.planck)│ rep_pen=1.00 ngram=0 cuda_rep=1.00  │
                  └─────────────────────┴─────────────────────────────────────┘
```

`temperature` is already 0 in production (greedy in the strict sense
= argmax). The "current" and "greedy" labels above differ ONLY in
repetition-penalty + ngram-blocker.

## Headline results

| cell | pack | spec | anchor pass | rate |
|------|------|------|-------------|------|
| A | production BD6 pass-1 | rep=1.15, ng=2 (current) | **19/19** | 100.0 % |
| B | production BD6 pass-1 | rep=1.00, ng=0 (greedy)  | **1 /19** | 5.3 % |
| C | BD6.5 (rejected v5)   | rep=1.15, ng=2 (current) | **15/19** | 78.9 % |
| D | BD6.5 (rejected v5)   | rep=1.00, ng=0 (greedy)  | **16/19** | 84.2 % |

## Per-row truth table (only the 4 BD6.5 holdouts and the moves)

| task_id        | tgt len | A prod/cur | B prod/greedy | C v5/cur | D v5/greedy | verdict |
|----------------|---------|------------|----------------|----------|-------------|---------|
| MBPP/53        | 256     | ✓ | ✗ | **✗ holdout** | **✗ still fail** | weights — both decode modes can't recover |
| HumanEval/34   | 182     | ✓ | ✗ | **✗ holdout** | ✓ greedy fixed | DECODER noise — recoverable |
| HumanEval/45   | 249     | ✓ | ✗ | **✗ holdout** | ✓ greedy fixed | DECODER noise — recoverable |
| HumanEval/85   | 366     | ✓ | ✗ | **✗ holdout** | **✗ still fail** | weights — both decode modes can't recover |
| MBPP/20        | 66      | ✓ | ✗ | ✓        | **✗ greedy lost** | rep-penalty HELPS this short prompt |
| HumanEval/53   | 101     | ✓ | ✓ | ✓        | ✓             | invariant under both modes |

The 4 holdouts split exactly half: **HE/34 and HE/45 are decoder-noise
holdouts**, recoverable by removing rep-penalty. **MBPP/53 and HE/85 are
real weight-gap holdouts**, robust to decoder change.

## The production-fragility surprise

**A → B drops 19/19 → 1/19** (only HE/53 survives pure greedy on the
production pack). This is much more than a quality dip — BD6 pass-1
is essentially **unable to produce usable Python** without the
repetition-penalty + ngram-blocker active.

That's because:
* `anchor_positive.jsonl` (the 19-row positive target set used by every
  BD6.x training pass) was captured by running the production pack
  **with the production sampler** (rep=1.15, ng=2) and recording the
  outputs. The "anchor 19/19" measurement that has gated every BD6.x
  pass is a function of WEIGHTS × DECODER, not weights alone.
* Without the penalty, the BD6 pass-1 base distribution falls into
  loops/repetition on Python keywords (`return`, `def`, ` `, etc.) and
  the verifier rejects almost everything.

**Implication:** the strict gate "anchor 19/19" was always testing the
joint weights+decoder system, never weights in isolation. Calling a
pack "production-ready" means "ready under THIS decoder config." Any
proposed runtime decoder change has to re-run the gate from scratch.

## What this means for the BD6.5 holdouts

Per the user's decision tree:
> If deterministic fixes holdouts: next = decode/extractor hardening
> If deterministic does not fix: next = BD6.8D token-weighted CE on BD6.5 dataset shape

The answer is **both**, partitioned:

* **HE/34, HE/45** — fixable by decoder change. Removing rep-penalty
  on phys05_code_skeleton recovers them. But that same change drops
  production to 1/19, so we cannot ship that decoder. The path to
  fixing these without breaking production is to find an
  **intermediate decoder spec** (e.g., rep=1.05 + ngram=0, or
  rep=1.15 + ngram=0, etc.) — a scan in the (rep_penalty, ngram)
  hyperparameter space that keeps prod at 19/19 while letting v5 reach
  18-19/19. This is **Lever F+** — runtime hyperparameter scan, not
  surgery.

* **MBPP/53 (256 chars), HE/85 (366 chars)** — *not* fixable by
  decoder. Both fail under greedy on v5 too. These are real weight
  gaps. **This is where Lever D (token-weighted CE) is still on the
  table.** The diagnostic from BD6.5/BD6.6 stands: long-target loss
  dominates gradient and the LoRA "tries hard" early then drifts under
  poison pressure. Token-weighted CE addresses exactly that.

* **MBPP/20** — surprise: rep-penalty *helps* this short prompt (66
  chars). Removing it loses the row. This means the (rep_penalty,
  ngram) choice has both anchoring and de-anchoring effects depending
  on row content, which makes scan-only Lever F+ a delicate problem.

## The cleaner reading of the BD6.x trajectory

The history:

```
BD6 pass-1   anchor 19/19   under (rep=1.15, ng=2)        — production
BD6.3        anchor  0/19   under (rep=1.15, ng=2)        — REVERT
BD6.4        anchor  7/19   under (rep=1.15, ng=2)        — REVERT
BD6.5        anchor 15/19   under (rep=1.15, ng=2)        — REVERT (15/19 < 19/19)
BD6.5        anchor 16/19   under (rep=1.00, ng=0)        — JUST MEASURED, reframes BD6.5
BD6.6        anchor 11/19   under (rep=1.15, ng=2)        — REVERT
BD6.7a-c     anchor 12/19, 10/19, 12/19   under (rep=1.15, ng=2)  — REVERT (KL ladder fail)
```

**BD6.5 was 16/19 under greedy** — closer to the 19/19 ceiling than any
of the post-replication or KL attempts. The lossy 15/19 measurement we
gated on was inflated/diminished by decoder co-engineering. **BD6.5 is
even better than we thought as a weight artifact.** The reason it
failed the gate isn't that its weights are 15/19 — it's that the
gate's "production decoder" config was tuned for production weights.

## Recommended order for BD6.8 (pending user GO)

### Step 1 — Lever F+ : (rep_penalty, ngram) scan on both packs

Test a small grid of decoder specs against both packs:
* (1.15, 2)  — production current (control)
* (1.10, 2)
* (1.05, 2)
* (1.15, 0)
* (1.10, 0)
* (1.05, 0)
* (1.00, 0) — greedy already done

Find any (rep, ng) where:
* production ≥ 19/19
* v5 ≥ 17/19

If such a sweet spot exists, **shipping just becomes a runtime config
change** with NO surgery needed. v5 can be promoted with a new decoder
spec.

If no sweet spot exists (production is too brittle to small changes),
go to step 2.

### Step 2 — Lever D : token-weighted CE on BD6.5 dataset shape

Now narrowly aimed at MBPP/53 and HE/85 (the two weight-gap holdouts).
Use BD6.5 dataset unchanged (the peak data shape). Modify trainer's
loss step to scale per-row CE by `1 / sqrt(target_token_count)`. ~10
lines. Train. Gate.

DO NOT add KL again — BD6.7 ladder showed KL is a redundant lever
when teacher = student size.

## Production state (after BD6.8F revert)

* `PHYS05_PACK = physarum05b_code_skeleton.planck` (BD6 pass-1, restored).
* `phys05_code_skeleton` spec: `rep_penalty=1.15, no_repeat_ngram=2,
  cuda_repetition_penalty=1.05` (restored).
* Anchor 19/19 verification ran post-revert (re-confirmed in same run as
  this report — see `reports/BD6_7_KL_ANCHOR_LADDER.md` for the
  revert+verify trail; this BD6.8F probe was run after that and uses
  the same restored production state).

## Files this probe touched

* `src/organs/organ_manager.cpp::PHYS05_PACK` — flipped to v5 then back to v1
* `src/organs/organ_manager.cpp::add05("phys05_code_skeleton", …)` —
  rep_penalty/ngram/cuda_rep edited 1.15/2/1.05 ↔ 1.00/0/1.00 then restored
* `reports/BD6_8F_RUNTIME_DETERMINISM_TEST.md` — this file

No data files written, no LoRA produced, no .planck repacked.

## What this proves

* **The strict gate is not a pure-weights gate.** It's
  weights × decoder. Any future surgery decision should treat
  decoder spec as a *trainable* variable, not an unchanging
  constant.
* **Production BD6 pass-1 is decoder-brittle.** Drop rep-penalty and
  it falls to 1/19. This is hidden risk — if a future runtime change
  ever touches sampler defaults, production silently degrades.
  Worth flagging as "PHYS05_DECODER_LOCKED" in CURRENT_TRUTH_LEDGER.
* **BD6.5 is closer to 19/19 than the gate showed.** Under greedy
  it's 16/19; the 15/19 gate measurement under production decoder
  *under-reports* its weight quality.
* **The 4 BD6.5 holdouts are 2+2.** Two are decoder-noise (HE/34,
  HE/45 — fixable by tuning rep_penalty). Two are real weight gaps
  (MBPP/53, HE/85 — need Lever D / token-weighted CE).

The next move is the (rep_penalty, ngram) scan, not training. If
the scan finds a (production ≥ 19/19, v5 ≥ 18/19) sweet spot, BD6.5
ships with NO new surgery. If not, only then does BD6.8D start.
