# RUNTIME_ORGANISM_BENCH v1 — TRACK 2 + 4 first integration (2026-05-02)

**Mission:** prove that Black-Dog conductance is **read** before chain
selection (TRACK 2) and that critic_lite + wound run **before** 7B
fallback on json/code/ariz failures (TRACK 4). 30 mixed ARIZ tasks,
5 repeated runs. DAG visible per row.

## Result

| run | pass | fb | repair_attempt | repair_rescue | chain_picks (organ_only / organ_repair / organ_7b) |
|-----|------|----|----------------|---------------|-----------------------------------------------------|
| 0   | 26/30 | 0 | 0              | 0             | 30 / 0 / 0   ← cold start (all 0 conductance) |
| 1   | 26/30 | 0 | 4              | 0             | 26 / **4** / 0 ← 4 failures from r0 routed to repair |
| 2   | 26/30 | 0 | 0              | 0             | 26 / 0 / **4** ← repair didn't rescue → routed to 7b |
| 3   | 26/30 | 0 | 0              | 0             | 30 / 0 / 0   ← all chains -1 → tiebreak picks cheapest |
| 4   | 26/30 | 0 | 4              | 0             | 26 / **4** / 0 ← cycle restarts |

**TRACK 2 DOD audit**

| spec gate                                        | status |
|--------------------------------------------------|--------|
| router reads conductance before selecting organs | ✅ harness loads BD store + harness-local store, queries each candidate chain |
| DAG records selected_chain                       | ✅ each row writes `selected_chain` |
| DAG records candidate_conductances               | ✅ each row writes `candidate_conductances: {local: [...], cpp: [...]}` |
| successful chains get food                       | ✅ harness `local_update(food=1)` after each pass |
| failed chains get poison                         | ✅ `local_update(poison=1)` on fail |
| route choice changes on repeated tasks           | ✅ ARIZ/01 trajectory: organ_only → organ_repair → organ_7b → organ_only → organ_repair |

**TRACK 2 → MET.**

## TRACK 4 DOD audit

| spec gate                                            | status |
|------------------------------------------------------|--------|
| critic + wound runs BEFORE 7b on json/code/ariz fail | ✅ harness invokes phys05_critic_lite + phys05_wound when organ_only fails |
| every repair attempt writes BD food/poison           | ✅ via local store update |
| ≥3 non-terminal repairs rescued without 7B           | ❌ **0 rescues** |

**TRACK 4 → MECHANISM WIRED but RESCUE COUNT = 0.**

## Why repair rescue = 0 (honest)

The two organs at the repair stations were not trained for the ARIZ
6-field JSON schema:

* `phys05_critic_lite` — currently tuned to diagnose **terminal/code**
  failures (stderr/exit-code style).
* `phys05_wound` — emits **shell-patch / sed / printf** style edits
  for terminal recovery, not strict-JSON repair.

Both fired (4 attempts in run1, 4 in run4 — visible in `repair_attempt`
column) but produced outputs that did not pass the ARIZ verifier.

This is a **training-side blocker, not a wiring blocker.** TRACK 4
mechanism is operational (verifier fail → critic call → wound call →
re-verify → 7B fallback only after wound fails). To turn rescue rate
positive on json/code/ariz routes, BD8 surgery is required:
* `phys05_critic_lite` needs ARIZ failure-mode samples in its training
  set (failed schema diagnosis).
* `phys05_wound` needs JSON-repair training (failed JSON → fixed JSON).

This is a queued follow-up task. **Loop architecture works**; rescue
quality is the next surgery target.

## Persistent failure pattern

4 of 30 ARIZ tasks fail repeatedly across all chains (ARIZ/01, /04, /05,
and one more — visible in trajectory dump). These are tasks where the
trained TRIZ organ + 7B + repair organs all produce schema-incomplete
outputs. The trajectory bounces:

```
ARIZ/01: r0 organ_only/FAIL → r1 organ_repair/FAIL → r2 organ_7b/FAIL
        → r3 organ_only/FAIL  ← all conductances are -1; tiebreak picks
                                 organ_only (cheapest); cycle restarts
        → r4 organ_repair/FAIL
```

The harness's tiebreak rule (`prefer cheapest when all conductances
equal`) means after every chain fails once, the next run starts the
cycle over. Smarter behaviour would be: when ALL chains have
conductance < 0, mark the task as "outside system reach" and skip
further attempts. That's a v0.2 refinement.

## Honest TRACK 2+4 status

**TRACK 2 (conductance router) — v0.1 SHIPPED.** Conductance is read,
arbitration happens, route selection changes on repeated tasks. DAG
proves it. Implementation is in Python harness (not C++ runtime), so
the v0.2 follow-up is to port the same arbitration logic into
`run_chat_ariz_organ_first` and friends. ~2-3 hours C++.

**TRACK 4 (critic + wound for non-terminal routes) — v0.1 WIRED but
RESCUE = 0.** The mechanism (call critic, call wound, re-verify before
7B fallback) is operational and DAG-visible. Zero rescues happened
because the organs at the repair stations are trained for terminal
failures, not ARIZ JSON. Surgery to retrain critic_lite + wound on
json/code/ariz failure samples is a queued follow-up (BD8).

## Files

* `tools/bench/runtime_organism_bench.py` — 30×5 harness with
  arbitration + critic+wound rescue path
* `reports/RUNTIME_ORGANISM_BENCH_V1.json` — full per-task trajectory
* `reports/RUNTIME_ORGANISM_BENCH_V1.md` — this file

## Next concrete work

1. **BD8 critic+wound surgery for ARIZ schema repair** —
   teacher-student: collect 50-100 (broken_json, fixed_json) pairs
   from existing TRIZ failures, train phys05_critic_lite_v2 +
   phys05_wound_v2, gate on rescue rate ≥ 30 % on this same harness.
2. **Port Python arbitrator → C++ runtime** —
   `pick_chain(route, pattern_hash, candidates)` helper called from
   `run_chat_ariz_organ_first` (and code/json equivalents).
3. **Smarter "outside system reach" tiebreak** — if max(conductance) <
   threshold, mark task as system-failure rather than re-cycling
   chains. Saves wall time on hopeless tasks.
