BD8 — critic + wound rescue path, V9 close (2026-05-04)
TL;DR — TRACK 4 GREEN at last. After eight bench iterations (V1…V8 all 0 rescues), V9 lands 4 rescues across 5 runs, delta_pass +1, conductance arbitration confirmed. The fix was not more training: it was bypassing the mis-trained critic_lite and feeding the verifier's structured failure reason directly into the wound prompt. The mechanism wired in V1 was correct; the diagnosis layer was poisoning the wound's input.
Headline (V9 result)
30 ARIZ contradiction tasks × 5 runs
run | pass | fb | repair_attempt | repair_rescue | picks
----+------+----+----------------+---------------+--------------------
0 | 26/30| 0 | 0 | 0 | organ_only:30 / repair:0 / 7b:0
1 | 27/30| 0 | 4 | 1 | organ_only:26 / repair:4 / 7b:0
2 | 27/30| 0 | 1 | 1 | organ_only:26 / repair:1 / 7b:3
3 | 27/30| 0 | 1 | 1 | organ_only:29 / repair:1 / 7b:0
4 | 27/30| 0 | 4 | 1 | organ_only:26 / repair:4 / 7b:0
----+------+----+----------------+---------------+--------------------
Σ attempt=10 Σ rescue=4
DOD audit:
total repair rescues = 4 (DOD: ≥3) ✅
delta_pass run0 → run4 = +1 ✅
conductance changed selection = True ✅
organs_used (mode B) = phys05_triz_contradiction
+ phys05_wound (rescued path)
+ physarium_7b (run2 only, when conductance
pushed the harness through 7b briefly)
organ leaks = 0 (NO_7B_FALLBACK gate held)
fallback_count = 0 across all 150 rows
Source: reports/RUNTIME_ORGANISM_BENCH_V9.json.
What actually fixed it
V1–V8 all stalled at 0 rescues. Each iteration unblocked one mechanism but found the next blocker:
| iter | what landed | result | next blocker discovered | |------|----------------------------------------------|-----------------------|------------------------| | V1 | TRACK 2 + 4 mechanism wired | 0 rescue | critic+wound trained on stderr, not ARIZ JSON | | V2 | retrained critic_v1 + wound_v1 on ARIZ data | 0 rescue | dataset double-encoded target (string-of-JSON, not JSON) | | V3 | dataset v2 (string→dict) | 0 rescue | invalid \' JSON escape in wound output | | V4 | manual probe of wound output | 0 rescue | extract_json fails on \' | | V5 | extract_json escape repair patch | 0 rescue | wound emits {"k", "v"} (comma instead of colon) | | V6 | (killed) | n/a | discovered max_tokens=384 truncates wound mid-array | | V7 | wound max_tokens 384 → 768 | 0 rescue | runtime caps --organ-probe output_excerpt at 600 chars | | V8 | output_excerpt cap 600 → 4096 | 0 rescue | critic_lite mis-classifies ARIZ failures as markdown_wrapper | | V9 | skip critic, feed verifier reason direct | 4 rescues, +1 pass| ARIZ/01 conductance lock-in |
The decisive change was the harness-side rewrite in run_chain_organ_repair: no critic call, structured verifier reason mapped to a one-sentence diagnosis, fed straight into the wound prompt. The wound organ then has the actual signal it needs ("the candidate_moves array has fewer than 2 items; add one more move") instead of the critic's drift ("strip the markdown fence").
What V9 changed in code
src/main.cpp:4375 output_excerpt cap 600 → 4096
src/organs/organ_manager.cpp phys05_wound.max_tokens 384 → 768
phys05_critic_lite.max_tokens 192 → 256
tools/bench/runtime_organism_bench.py
extract_json: tolerant brace-close on truncated outputs
extract_json: comma→colon repair for wound's pseudo-JSON
extract_json: \' escape strip
run_chain_organ_repair: skip critic_lite,
use _verifier_reason_to_diag(A["reason"])
organs_used reports phys05_wound (no longer phys05_critic_lite)
Distribution of rescues — honest detail
All 4 rescues are on ARIZ/01 (pipeline task). The other three recurring fail-tasks (ARIZ/04, /05, /11) attempted repair but the wound's output still violated the schema (typically 1-item triz_operators or candidate_moves array — model emits 1 item where verifier requires ≥2). Once ARIZ/01 rescued in run1, its conductance for organ_repair went positive, so runs 2-4 deterministically picked organ_repair for ARIZ/01 first → repeat rescue.
Net per-task picture across all 5 runs:
| task | run0 | run1 | run2 | run3 | run4 | failure pattern | |---------|------|------|------|------|------|------------------| | ARIZ/01 | FAIL | RESCUE| RESCUE | RESCUE | RESCUE | wound emits valid 6-field after diag | | ARIZ/04 | FAIL | FAIL | FAIL | FAIL | FAIL | wound emits 1-item arrays | | ARIZ/05 | FAIL | FAIL | FAIL | FAIL | FAIL | wound emits 1-item arrays | | ARIZ/11 | FAIL | FAIL | FAIL | FAIL | FAIL | wound emits 1-item arrays |
The 26 already-passing tasks stayed passing throughout — no regression on the organ_only path.
TRACK 2 status (conductance arbitration)
Verified GREEN by V9 trajectory. The arbitration trace shows:
- run0 — cold start, all conductances 0 → cheapest chain
organ_onlypicked. - run1 — ARIZ/01 had
organ_onlyconductance ≈ -0.4 (poisoned by run0
failure), so harness selected organ_repair (≈ 0.0) → rescued → food=1.
- run2 — ARIZ/01
organ_repairconductance positive ≈ +0.2, harness
picked it directly. The 3 other fails saw all chains poisoned → fell to organ_7b (which still failed because the bench is in Mode A=B harness logic, not full Mode C).
- run3-4 — ARIZ/01 stayed on
organ_repair(positive feedback loop).
DAG entries record selected_chain and candidate_conductances per row.
TRACK 4 status (critic + wound for non-terminal routes)
GREEN by DOD. The mechanism (verifier fail → wound → re-verify → 7B last) is operational and DAG-visible. Rescue rate is non-zero on real ARIZ failures with no 7B leak.
The critic_lite organ is out of the rescue path for ARIZ JSON because it mis-classifies. It still has its terminal-stderr role intact (BD7 runtime path); ARIZ JSON failures are routed around it. A future BD8.x can retrain critic_lite on ARIZ-failure samples specifically — at that point we re-introduce it. For now, "verifier reason → wound directly" is faster and works.
Production state (2026-05-04)
PHYS05_PACK physarum05b_code_skeleton.planck MBPP B 13/100 · HE B 6/164 · LCB 0/50 · anchor 19/19 · frozen
PHYS05_TRIZ_PACK physarum05b_triz_contradiction_v2 ARIZ 88/100 strict · fb=0 · unchanged
PHYS05_CRITIC_PACK physarum05b_critic_lite_v2.planck kept; not in ARIZ rescue path; still wired for terminal failures
PHYS05_WOUND_PACK physarum05b_wound_v2.planck in production for ARIZ JSON repair (rescue rate >0)
PHYS7B_PACK physarium7b_identity.q4planck Q4 · 11.16 tok/s · identity 14/14
runtime acceptance 18/18 · leaks 0 · wall 2.99 s (last v22 run)
decoder spec changes:
phys05_wound.max_tokens 384 → 768
phys05_critic_lite.max_tokens 192 → 256
runtime change:
--organ-probe output_excerpt cap 600 → 4096
harness change:
run_chain_organ_repair skips critic_lite, uses verifier reason directly
Files this iteration produced
tools/bench/runtime_organism_bench.py— extract_json hardened (tolerant
close + escape repair + key-comma repair), run_chain_organ_repair rewritten to bypass critic
src/main.cpp—--organ-probeoutput_excerpt cap 600 → 4096src/organs/organ_manager.cpp— phys05_wound max_tokens bumped, commentsreports/RUNTIME_ORGANISM_BENCH_V7.json— interim run, max_tokens fix only, 0 rescuesreports/RUNTIME_ORGANISM_BENCH_V8.json— output_excerpt cap fix, 0 rescuesreports/RUNTIME_ORGANISM_BENCH_V9.json— critic-skip fix, 4 rescues, DOD ✅reports/BD8_CRITIC_WOUND_FINAL.md— this file
Engineering takeaways
- A mis-trained organ poisons every organ downstream of it. critic_lite v2
trained on stderr-style failures was emitting "markdown_wrapper" for ARIZ schema failures, and wound was correctly fixing nothing because nothing it tried matched the diagnosis. Cost: 8 V-bench iterations to find this.
- The verifier already knows the answer. The harness was discarding
structured reason ("candidate_moves") and asking critic_lite to re-derive it from raw output. Skip the re-derivation; route the structured reason directly.
- Runtime caps are silent.
--organ-probe output_excerptcap at
600 chars was correct for V1-era short organs but invisibly failing the longer wound output. Bumped to 4096; future organs that emit structured prose will not hit it.
- Conductance compounds. Once one rescue lands on a pattern,
conductance pushes future runs into the same chain → rescue rate climbs without further training. ARIZ/01 demonstrates this — a single positive learning event compounded into 4 rescues across the next 4 runs.
- Eight reverts, one keep — same pattern as BD6. This is the gate
doctrine working: nothing got promoted that didn't pass the DOD, nothing got hidden as success when it wasn't, every revert was recorded with a named blocker and a fix.
What's next (BD8.x candidates, queued not blocking)
- BD8.1 — retrain critic_lite on ARIZ-shape failures so it joins
the rescue path again as a sanity check rather than a poison source.
- BD8.2 — wound retrain to lift 1-item-array failures (target:
rescue ARIZ/04, /05, /11 too, lifting rescue rate from 4/10 to ≥7/10).
- TRACK 2 C++ port — port the Python arbitration helper into
run_chat_ariz_organ_first and friends. Makes V9's behaviour the default in production --chat, not just in the bench harness.
These don't block anything; the DOD line is closed.