BD6.5 — stratified anchor curriculum, 15/19 anchor, gated short of strict 19/19 (2026-05-02)

**TL;DR — best surgery pass yet. Stratified anchor/poison interleaving

53 % anchor share + bench-aware replication (HE 20×, MBPP-long 20×,

MBPP-short 10×) drove anchor pass-rate from 7/19 (BD6.4) to 15/19 (78.9 %) with only 4 regressions, all on long HumanEval / MBPP/53. The strict 19/19 gate still rejected the pack — but the trajectory is now 0 → 7 → 15, and the four remaining losses are all long-target prompts. Production reverted. v5 archived; the 4 holdouts inform BD6.6.**

Pipeline (PYTHON_QUARANTINE-compliant)

production: physarum05b_code_skeleton.planck   (BD6 pass-1, anchor 19/19)
anchor_positive.jsonl: 19 captured pass-1 outputs

bench-aware replication (BD6.5 lever A):
  HE anchors        × 20 = 120 rows   (longest targets, most fragile)
  MBPP long anchors × 20 =  60 rows   (target ≥150 chars)
  MBPP short anchors× 10 = 100 rows   (simple defs survive easier)
  → 280 anchor rows total
+ poison_train.jsonl 245 with refs
= bd6_5_mixed_train.jsonl, 525 rows, 53.3 % anchor share

stratified trainer (BD6.5 lever B): tools/surgery/train_code_skeleton_lora_bd6_5.py
  --rank 8 --alpha 16 --lr 5e-5 --epochs 1 --checkpoint-steps 100
  every gradient step alternates anchor / poison so no batch is
  pure-poison; checkpoint at each 100 steps for retrospective pick

trainable params = 1.08 M / 495 M = 0.22 %
loss curve:
  step  25  loss 1.79   (cold start)
  step  50  loss 0.92
  step 100  loss 0.12   ckpt_step100
  step 200  loss 0.29   ckpt_step200
  step 300  loss 1.08   ckpt_step300  (poison spike)
  step 400  loss 1.15   ckpt_step400  (poison spike)
  step 500  loss 0.05   ckpt_step500
  step 525  loss 0.07   final
  epoch avg 0.5354

merge → physarum05b_code_skeleton_v5.planck
flip → rebuild → anchor_eval

anchor gate (final adapter):
  ▼
  15 / 19 PASS  (rate 78.9 %)
  ▼
  threshold 85 % (= 17/19 strict spec 19/19) → REJECT
  ▼
also tried ckpt100 = 0/19 (mid-flight noise, far from convergence)
  ▼
REVERT: PHYS05_PACK = physarum05b_code_skeleton.planck
REBUILD
VERIFY: anchor 19/19 ✅ production safe

Numbers across all four BD6.x passes

| pass | dataset shape | r | lr | ep | anchor share | anchor post-merge | gate | |-------------|----------------------------------|----|------|----|-------------|--------------------|------------| | BD6 pass-1 | poison v1 (256 w/ refs) | 16 | 2e-4 | 3 | 0 % | 19/19 (defines) | KEEP | | BD6.2 | union v1∪v2 (260) | 16 | 2e-4 | 4 | 0 % | not run | REVERT (post-bench MBPP regress) | | BD6.3 | fresh-only (245) | 8 | 1e-4 | 1 | 0 % | 0 / 19 | REVERT | | BD6.4 | fresh + 5× anchor (340) | 8 | 1e-4 | 1 | 28 % | 7 / 19 | REVERT | | BD6.5 | fresh + bench-aware-anchor (525) | 8 | 5e-5 | 1 | 53 % | 15 / 19 | REVERT (still < strict 19/19) |

The trajectory is monotone: 0 → 7 → 15. Each lever moved the needle:

anchor introduced (BD6.4) ⇒ +7
anchor-share 28 % → 53 % + stratified batches + lower lr (BD6.5) ⇒ +8

Per-row anchor result on v5 final

KEPT (15):
  MBPP/17, MBPP/19, MBPP/20, MBPP/41, MBPP/51, MBPP/52,
  MBPP/64, MBPP/90, MBPP/93, MBPP/96, MBPP/99, MBPP/105
  HumanEval/23, HumanEval/27, HumanEval/53

LOST (4):
  MBPP/53        (target len ~256 chars, MBPP-long)
  HumanEval/34   (target len ~182 chars)
  HumanEval/45   (target len ~249 chars)
  HumanEval/85   (target len ~366 chars, longest anchor)

The pattern is now crisp: all 4 regressions are on long anchor targets. The 6 short MBPP anchors that BD6.4 lost (e.g. MBPP/17, 20, 53, 90, 93, 96) are now 5/6 saved. The remaining holdout is the longest content-rich anchors — MBPP/53 has a 256-char target with multiple clauses, HumanEval/34/45/85 are docstring-completion patterns where the LoRA still drifts under one epoch of poison co-training.

What's needed for BD6.6 (clear, narrow lever)

The pattern says the residual 4 losses are about target length, not about prompt difficulty. Two sharp leverage points:

Lever D — token-weighted loss on long anchors

Currently every anchor row contributes equal cross-entropy. Long targets have 2-3× more tokens, so their per-row loss looks bigger to the optimizer and the LoRA "tries hard" to memorize them in the first 50 steps, then drifts under continued poison gradient.

Fix: scale loss inversely to target length only on poison rows (or upsample long-anchor batches further so they get more revisits). 10-line change.

Lever E — KL-anchor on the 4 holdouts only

Run inference on [MBPP/53, HumanEval/34, /45, /85] against the frozen pass-1 base, capture top-k logits per token, add a KL term to the LoRA training step that pulls the student toward those reference logits on those four prompts. This is the canonical anti-forgetting fix and works exactly because we have identified the four prompts that need it. ~30 lines.

Cheap path is D first, then E if D leaves any holdout.

Production state (after BD6.5 revert)

PHYS05_PACK = physarum05b_code_skeleton.planck (BD6 pass-1, unchanged).
MBPP B = 13/100, HumanEval B = 6/164, LCB B = 0/50, anchor 19/19.
physarum05b_code_skeleton_v5.planck archived.
physarum05b_code_skeleton_v5_ckpt100.planck archived (the 0/19 mid-flight snapshot, also a useful negative).
tools/surgery/output/code_skeleton_lora_v5/ includes: final adapter + 5 mid-checkpoints (ckpt_step100 … ckpt_step500).

What this proves

Anchor weight is a continuous lever, not a binary. Three data

points trace a line: 0 % → 0/19, 28 % → 7/19, 53 % → 15/19. A 4th point at ~70 % share + length-weighted loss should hit 17–19/19.

Stratified minibatches matter. BD6.4's pure shuffled batches

let many consecutive poison-only batches drift the LoRA between anchor revisits. BD6.5's anchor-poison-anchor-poison rhythm fixed that.

Mid-checkpoints are not always safer. ckpt100 here was 0/19

even though loss was lower than ckpt300. The LoRA is in a non-monotonic regime during the first epoch; final usually wins if you're going to make 1 epoch through a stratified curriculum.

The strict gate is correct. 15/19 looks tantalizing but

shipping a pack that drops 4 known wins is exactly the production pollution the user banned. Reject means reject.

Files this pass touched

tools/surgery/train_code_skeleton_lora_bd6_5.py — new, stratified trainer with checkpointing
data/organ_surgery/phys05_code_skeleton/bd6_5_mixed_train.jsonl — 525-row weighted set
tools/surgery/output/code_skeleton_lora_v5/ — final adapter + 5 mid-checkpoints
tools/surgery/output/Physarum05B-CodeSkeleton-v5/ — merged HF dir (rejected)
physarum05b_code_skeleton_v5.planck — repacked (rejected, archived)
physarum05b_code_skeleton_v5_ckpt100.planck — mid-flight snapshot (rejected, archived)
src/organs/organ_manager.cpp::PHYS05_PACK — flipped to v5/ckpt100 then back to v1
reports/BD6_5_STRATIFIED_15_OF_19.md — this file

The runtime is on the production pack. The gate did its job. The trajectory 0→7→15 with the pattern locked to long-target anchors gives BD6.6 a narrow, well-posed problem to solve.