# BD8 — critic + wound rescue path, V9 close (2026-05-04)

**TL;DR — TRACK 4 GREEN at last. After eight bench iterations (V1…V8 all 0
rescues), V9 lands 4 rescues across 5 runs, delta_pass +1, conductance
arbitration confirmed. The fix was not more training: it was bypassing
the mis-trained critic_lite and feeding the verifier's structured failure
reason directly into the wound prompt. The mechanism wired in V1 was
correct; the diagnosis layer was poisoning the wound's input.**

## Headline (V9 result)

```
30 ARIZ contradiction tasks × 5 runs

run | pass | fb | repair_attempt | repair_rescue | picks
----+------+----+----------------+---------------+--------------------
 0  | 26/30|  0 | 0              | 0             | organ_only:30 / repair:0  / 7b:0
 1  | 27/30|  0 | 4              | 1             | organ_only:26 / repair:4  / 7b:0
 2  | 27/30|  0 | 1              | 1             | organ_only:26 / repair:1  / 7b:3
 3  | 27/30|  0 | 1              | 1             | organ_only:29 / repair:1  / 7b:0
 4  | 27/30|  0 | 4              | 1             | organ_only:26 / repair:4  / 7b:0
----+------+----+----------------+---------------+--------------------
                    Σ attempt=10  Σ rescue=4

DOD audit:
  total repair rescues          = 4    (DOD: ≥3)            ✅
  delta_pass run0 → run4        = +1                         ✅
  conductance changed selection = True                       ✅
  organs_used (mode B)          = phys05_triz_contradiction
                                + phys05_wound (rescued path)
                                + physarium_7b (run2 only, when conductance
                                  pushed the harness through 7b briefly)
  organ leaks                   = 0 (NO_7B_FALLBACK gate held)
  fallback_count                = 0 across all 150 rows
```

Source: `reports/RUNTIME_ORGANISM_BENCH_V9.json`.

## What actually fixed it

V1–V8 all stalled at 0 rescues. Each iteration unblocked one mechanism
but found the next blocker:

| iter | what landed                                  | result                | next blocker discovered |
|------|----------------------------------------------|-----------------------|------------------------|
| V1   | TRACK 2 + 4 mechanism wired                  | 0 rescue              | critic+wound trained on stderr, not ARIZ JSON |
| V2   | retrained critic_v1 + wound_v1 on ARIZ data  | 0 rescue              | dataset double-encoded `target` (string-of-JSON, not JSON) |
| V3   | dataset v2 (string→dict)                     | 0 rescue              | invalid `\'` JSON escape in wound output |
| V4   | manual probe of wound output                 | 0 rescue              | extract_json fails on `\'` |
| V5   | extract_json escape repair patch             | 0 rescue              | wound emits `{"k", "v"}` (comma instead of colon) |
| V6   | (killed)                                     | n/a                   | discovered max_tokens=384 truncates wound mid-array |
| V7   | wound max_tokens 384 → 768                   | 0 rescue              | runtime caps `--organ-probe` output_excerpt at 600 chars |
| V8   | output_excerpt cap 600 → 4096                | 0 rescue              | **critic_lite mis-classifies ARIZ failures as `markdown_wrapper`** |
| V9   | **skip critic, feed verifier reason direct** | **4 rescues, +1 pass**| ARIZ/01 conductance lock-in |

The decisive change was the harness-side rewrite in
`run_chain_organ_repair`: no critic call, structured verifier reason
mapped to a one-sentence diagnosis, fed straight into the wound
prompt. The wound organ then has the actual signal it needs ("the
candidate_moves array has fewer than 2 items; add one more move")
instead of the critic's drift ("strip the markdown fence").

## What V9 changed in code

```
src/main.cpp:4375               output_excerpt cap 600 → 4096
src/organs/organ_manager.cpp    phys05_wound.max_tokens 384 → 768
                                phys05_critic_lite.max_tokens 192 → 256
tools/bench/runtime_organism_bench.py
                                extract_json: tolerant brace-close on truncated outputs
                                extract_json: comma→colon repair for wound's pseudo-JSON
                                extract_json: \' escape strip
                                run_chain_organ_repair: skip critic_lite,
                                  use _verifier_reason_to_diag(A["reason"])
                                organs_used reports phys05_wound (no longer phys05_critic_lite)
```

## Distribution of rescues — honest detail

All 4 rescues are on **ARIZ/01** (pipeline task). The other three
recurring fail-tasks (ARIZ/04, /05, /11) attempted repair but the
wound's output still violated the schema (typically 1-item triz_operators
or candidate_moves array — model emits 1 item where verifier
requires ≥2). Once ARIZ/01 rescued in run1, its conductance for
`organ_repair` went positive, so runs 2-4 deterministically picked
organ_repair for ARIZ/01 first → repeat rescue.

Net per-task picture across all 5 runs:

| task    | run0 | run1 | run2 | run3 | run4 | failure pattern |
|---------|------|------|------|------|------|------------------|
| ARIZ/01 | FAIL | RESCUE| RESCUE | RESCUE | RESCUE | wound emits valid 6-field after diag |
| ARIZ/04 | FAIL | FAIL  | FAIL   | FAIL   | FAIL   | wound emits 1-item arrays |
| ARIZ/05 | FAIL | FAIL  | FAIL   | FAIL   | FAIL   | wound emits 1-item arrays |
| ARIZ/11 | FAIL | FAIL  | FAIL   | FAIL   | FAIL   | wound emits 1-item arrays |

The 26 already-passing tasks stayed passing throughout — no
regression on the organ_only path.

## TRACK 2 status (conductance arbitration)

Verified GREEN by V9 trajectory. The arbitration trace shows:
* run0 — cold start, all conductances 0 → cheapest chain `organ_only` picked.
* run1 — ARIZ/01 had `organ_only` conductance ≈ -0.4 (poisoned by run0
  failure), so harness selected `organ_repair` (≈ 0.0) → rescued → food=1.
* run2 — ARIZ/01 `organ_repair` conductance positive ≈ +0.2, harness
  picked it directly. The 3 other fails saw all chains poisoned →
  fell to `organ_7b` (which still failed because the bench is in
  Mode A=B harness logic, not full Mode C).
* run3-4 — ARIZ/01 stayed on `organ_repair` (positive feedback loop).

DAG entries record `selected_chain` and `candidate_conductances` per row.

## TRACK 4 status (critic + wound for non-terminal routes)

GREEN by DOD. The mechanism (verifier fail → wound → re-verify → 7B
last) is operational and DAG-visible. Rescue rate is non-zero on
real ARIZ failures with no 7B leak.

The critic_lite organ is **out of the rescue path** for ARIZ JSON
because it mis-classifies. It still has its terminal-stderr role
intact (BD7 runtime path); ARIZ JSON failures are routed around it.
A future BD8.x can retrain critic_lite on ARIZ-failure samples
specifically — at that point we re-introduce it. For now, "verifier
reason → wound directly" is faster and works.

## Production state (2026-05-04)

```
PHYS05_PACK            physarum05b_code_skeleton.planck       MBPP B 13/100 · HE B 6/164 · LCB 0/50 · anchor 19/19 · frozen
PHYS05_TRIZ_PACK       physarum05b_triz_contradiction_v2      ARIZ 88/100 strict · fb=0 · unchanged
PHYS05_CRITIC_PACK     physarum05b_critic_lite_v2.planck      kept; not in ARIZ rescue path; still wired for terminal failures
PHYS05_WOUND_PACK      physarum05b_wound_v2.planck            in production for ARIZ JSON repair (rescue rate >0)
PHYS7B_PACK            physarium7b_identity.q4planck          Q4 · 11.16 tok/s · identity 14/14
runtime acceptance     18/18 · leaks 0 · wall 2.99 s          (last v22 run)

decoder spec changes:
  phys05_wound.max_tokens          384 → 768
  phys05_critic_lite.max_tokens    192 → 256
runtime change:
  --organ-probe output_excerpt cap  600 → 4096
harness change:
  run_chain_organ_repair skips critic_lite, uses verifier reason directly
```

## Files this iteration produced

* `tools/bench/runtime_organism_bench.py` — extract_json hardened (tolerant
  close + escape repair + key-comma repair), run_chain_organ_repair
  rewritten to bypass critic
* `src/main.cpp` — `--organ-probe` output_excerpt cap 600 → 4096
* `src/organs/organ_manager.cpp` — phys05_wound max_tokens bumped, comments
* `reports/RUNTIME_ORGANISM_BENCH_V7.json` — interim run, max_tokens fix only, 0 rescues
* `reports/RUNTIME_ORGANISM_BENCH_V8.json` — output_excerpt cap fix, 0 rescues
* `reports/RUNTIME_ORGANISM_BENCH_V9.json` — critic-skip fix, **4 rescues, DOD ✅**
* `reports/BD8_CRITIC_WOUND_FINAL.md` — this file

## Engineering takeaways

1. **A mis-trained organ poisons every organ downstream of it.** critic_lite v2
   trained on stderr-style failures was emitting "markdown_wrapper" for
   ARIZ schema failures, and wound was correctly fixing nothing because
   nothing it tried matched the diagnosis. Cost: 8 V-bench iterations
   to find this.
2. **The verifier already knows the answer.** The harness was discarding
   structured reason ("candidate_moves") and asking critic_lite to
   re-derive it from raw output. Skip the re-derivation; route the
   structured reason directly.
3. **Runtime caps are silent.** `--organ-probe output_excerpt` cap at
   600 chars was correct for V1-era short organs but invisibly failing
   the longer wound output. Bumped to 4096; future organs that emit
   structured prose will not hit it.
4. **Conductance compounds.** Once one rescue lands on a pattern,
   conductance pushes future runs into the same chain → rescue rate
   climbs without further training. ARIZ/01 demonstrates this — a
   single positive learning event compounded into 4 rescues across
   the next 4 runs.
5. **Eight reverts, one keep — same pattern as BD6.** This is the gate
   doctrine working: nothing got promoted that didn't pass the DOD,
   nothing got hidden as success when it wasn't, every revert was
   recorded with a named blocker and a fix.

## What's next (BD8.x candidates, queued not blocking)

* **BD8.1** — retrain critic_lite on ARIZ-shape failures so it joins
  the rescue path again as a sanity check rather than a poison source.
* **BD8.2** — wound retrain to lift 1-item-array failures (target:
  rescue ARIZ/04, /05, /11 too, lifting rescue rate from 4/10 to ≥7/10).
* **TRACK 2 C++ port** — port the Python arbitration helper into
  `run_chat_ariz_organ_first` and friends. Makes V9's behaviour the
  default in production --chat, not just in the bench harness.

These don't block anything; the DOD line is closed.
