CyberdyneLabs · Reports · BD6_8F_RUNTIME_DETERMINISM_TEST

BD6.8F — runtime determinism diagnostic, the surprise is in production (2026-05-02)

reports/BD6_8F_RUNTIME_DETERMINISM_TEST.md 1459 words raw markdown ↗

BD6.8F — runtime determinism diagnostic, the surprise is in production (2026-05-02)

TL;DR — the question was "does greedy decode fix the 4 BD6.5 holdouts?". The answer is split (2 yes, 2 no). The bigger discovery: PRODUCTION BD6 pass-1 collapses from 19/19 to 1/19 under pure greedy. anchor_positive.jsonl was captured WITH rep_penalty=1.15 + no_repeat_ngram=2 + cuda_rep_penalty=1.05 in the loop, so the gate's "19/19" is a co-engineered weights+decoder measurement, not a pure-weights one. BD6.5 weights are actually MORE robust under greedy (16/19) than the production pack (1/19). This reframes the entire BD6.x cycle: "weights vs decoder" is a tangled diagnosis, and surgery alone cannot win without considering decoder co-design.


What was tested

No training. No LoRA. No merge. No architecture change. Just four runtime configurations, each running the same 19 anchor prompts via the production anchor_eval.py harness:

                       PHYS05 PACK            sampler spec for phys05_code_skeleton
                  ┌─────────────────────┬─────────────────────────────────────┐
A) prod / current │ BD6 pass-1 (.planck)│ rep_pen=1.15 ngram=2 cuda_rep=1.05  │
B) prod / greedy  │ BD6 pass-1 (.planck)│ rep_pen=1.00 ngram=0 cuda_rep=1.00  │
C) v5   / current │ BD6.5      (.planck)│ rep_pen=1.15 ngram=2 cuda_rep=1.05  │
D) v5   / greedy  │ BD6.5      (.planck)│ rep_pen=1.00 ngram=0 cuda_rep=1.00  │
                  └─────────────────────┴─────────────────────────────────────┘

temperature is already 0 in production (greedy in the strict sense = argmax). The "current" and "greedy" labels above differ ONLY in repetition-penalty + ngram-blocker.

Headline results

| cell | pack | spec | anchor pass | rate | |------|------|------|-------------|------| | A | production BD6 pass-1 | rep=1.15, ng=2 (current) | 19/19 | 100.0 % | | B | production BD6 pass-1 | rep=1.00, ng=0 (greedy) | 1 /19 | 5.3 % | | C | BD6.5 (rejected v5) | rep=1.15, ng=2 (current) | 15/19 | 78.9 % | | D | BD6.5 (rejected v5) | rep=1.00, ng=0 (greedy) | 16/19 | 84.2 % |

Per-row truth table (only the 4 BD6.5 holdouts and the moves)

| task_id | tgt len | A prod/cur | B prod/greedy | C v5/cur | D v5/greedy | verdict | |----------------|---------|------------|----------------|----------|-------------|---------| | MBPP/53 | 256 | ✓ | ✗ | ✗ holdout | ✗ still fail | weights — both decode modes can't recover | | HumanEval/34 | 182 | ✓ | ✗ | ✗ holdout | ✓ greedy fixed | DECODER noise — recoverable | | HumanEval/45 | 249 | ✓ | ✗ | ✗ holdout | ✓ greedy fixed | DECODER noise — recoverable | | HumanEval/85 | 366 | ✓ | ✗ | ✗ holdout | ✗ still fail | weights — both decode modes can't recover | | MBPP/20 | 66 | ✓ | ✗ | ✓ | ✗ greedy lost | rep-penalty HELPS this short prompt | | HumanEval/53 | 101 | ✓ | ✓ | ✓ | ✓ | invariant under both modes |

The 4 holdouts split exactly half: HE/34 and HE/45 are decoder-noise holdouts, recoverable by removing rep-penalty. MBPP/53 and HE/85 are real weight-gap holdouts, robust to decoder change.

The production-fragility surprise

A → B drops 19/19 → 1/19 (only HE/53 survives pure greedy on the production pack). This is much more than a quality dip — BD6 pass-1 is essentially unable to produce usable Python without the repetition-penalty + ngram-blocker active.

That's because:

BD6.x training pass) was captured by running the production pack with the production sampler (rep=1.15, ng=2) and recording the outputs. The "anchor 19/19" measurement that has gated every BD6.x pass is a function of WEIGHTS × DECODER, not weights alone.

loops/repetition on Python keywords (return, def, , etc.) and the verifier rejects almost everything.

Implication: the strict gate "anchor 19/19" was always testing the joint weights+decoder system, never weights in isolation. Calling a pack "production-ready" means "ready under THIS decoder config." Any proposed runtime decoder change has to re-run the gate from scratch.

What this means for the BD6.5 holdouts

Per the user's decision tree:

If deterministic fixes holdouts: next = decode/extractor hardening
If deterministic does not fix: next = BD6.8D token-weighted CE on BD6.5 dataset shape

The answer is both, partitioned:

on phys05_code_skeleton recovers them. But that same change drops production to 1/19, so we cannot ship that decoder. The path to fixing these without breaking production is to find an intermediate decoder spec (e.g., rep=1.05 + ngram=0, or rep=1.15 + ngram=0, etc.) — a scan in the (rep_penalty, ngram) hyperparameter space that keeps prod at 19/19 while letting v5 reach 18-19/19. This is Lever F+ — runtime hyperparameter scan, not surgery.

decoder. Both fail under greedy on v5 too. These are real weight gaps. This is where Lever D (token-weighted CE) is still on the table. The diagnostic from BD6.5/BD6.6 stands: long-target loss dominates gradient and the LoRA "tries hard" early then drifts under poison pressure. Token-weighted CE addresses exactly that.

chars). Removing it loses the row. This means the (rep_penalty, ngram) choice has both anchoring and de-anchoring effects depending on row content, which makes scan-only Lever F+ a delicate problem.

The cleaner reading of the BD6.x trajectory

The history:

BD6 pass-1   anchor 19/19   under (rep=1.15, ng=2)        — production
BD6.3        anchor  0/19   under (rep=1.15, ng=2)        — REVERT
BD6.4        anchor  7/19   under (rep=1.15, ng=2)        — REVERT
BD6.5        anchor 15/19   under (rep=1.15, ng=2)        — REVERT (15/19 < 19/19)
BD6.5        anchor 16/19   under (rep=1.00, ng=0)        — JUST MEASURED, reframes BD6.5
BD6.6        anchor 11/19   under (rep=1.15, ng=2)        — REVERT
BD6.7a-c     anchor 12/19, 10/19, 12/19   under (rep=1.15, ng=2)  — REVERT (KL ladder fail)

BD6.5 was 16/19 under greedy — closer to the 19/19 ceiling than any of the post-replication or KL attempts. The lossy 15/19 measurement we gated on was inflated/diminished by decoder co-engineering. BD6.5 is even better than we thought as a weight artifact. The reason it failed the gate isn't that its weights are 15/19 — it's that the gate's "production decoder" config was tuned for production weights.

Recommended order for BD6.8 (pending user GO)

Step 1 — Lever F+ : (rep_penalty, ngram) scan on both packs

Test a small grid of decoder specs against both packs:

Find any (rep, ng) where:

If such a sweet spot exists, shipping just becomes a runtime config change with NO surgery needed. v5 can be promoted with a new decoder spec.

If no sweet spot exists (production is too brittle to small changes), go to step 2.

Step 2 — Lever D : token-weighted CE on BD6.5 dataset shape

Now narrowly aimed at MBPP/53 and HE/85 (the two weight-gap holdouts). Use BD6.5 dataset unchanged (the peak data shape). Modify trainer's loss step to scale per-row CE by 1 / sqrt(target_token_count). ~10 lines. Train. Gate.

DO NOT add KL again — BD6.7 ladder showed KL is a redundant lever when teacher = student size.

Production state (after BD6.8F revert)

cuda_repetition_penalty=1.05` (restored).

this report — see reports/BD6_7_KL_ANCHOR_LADDER.md for the revert+verify trail; this BD6.8F probe was run after that and uses the same restored production state).

Files this probe touched

rep_penalty/ngram/cuda_rep edited 1.15/2/1.05 ↔ 1.00/0/1.00 then restored

No data files written, no LoRA produced, no .planck repacked.

What this proves

weights × decoder. Any future surgery decision should treat decoder spec as a trainable variable, not an unchanging constant.

it falls to 1/19. This is hidden risk — if a future runtime change ever touches sampler defaults, production silently degrades. Worth flagging as "PHYS05_DECODER_LOCKED" in CURRENT_TRUTH_LEDGER.

it's 16/19; the 15/19 gate measurement under production decoder under-reports its weight quality.

HE/45 — fixable by tuning rep_penalty). Two are real weight gaps (MBPP/53, HE/85 — need Lever D / token-weighted CE).

The next move is the (rep_penalty, ngram) scan, not training. If the scan finds a (production ≥ 19/19, v5 ≥ 18/19) sweet spot, BD6.5 ships with NO new surgery. If not, only then does BD6.8D start.